COMPREHENSIVE DEEP DIVE ON ROAD ACCIDENTS IN THE UK (2022)
Author
Alexander Junod, Camille Leroy, Aurelien Urfer
Published
December 22, 2023
1. Introduction
1.1 Overview and Motivation
Road accidents continue to be one of the leading causes of death and hospitalizations around the world. Many factors such as weather, driver behavior and demographics contribute to a substantial number of deaths worldwide each year (1.3M on average based on research conducted by the WHO) (World Health Organization, 2022). In 2018 alone, it was estimated that there was a road death every 22 seconds (World Health Organization, 2018). In the case of the UK in 2022, a fatal road accident occurred every 33 minutes on average. These road accidents remain extremely complicated and provide a great base of information in which we can put our data science toolkit to work, to discover and reveal hidden information on patterns and insights into what influences road safety.
The motivation behind our project’s choice (UK Road Accident Analysis) is deeply rooted in personal experiences of the members of our group. Witnessing the life-altering consequences of road accidents firsthand has ignited our interest for understanding the factors that contribute to such accidents on the road. This emotional connection, combined with the fact that all our team members having to complete “Cours-2-phases” classes mandatory for our licenses, which explain the impacts of road accidents, has created a deep curiosity for the subject.
Beyond the emotional and curiosity realm, our team recognizes that our selected data sets offer a wealth of tangible data, enabling us to explore various data science domains. This project will facilitate the enhancement of our descriptive, regression, and spatial-temporal analytical skills, empowering us to discern patterns and relationships within the data.
1.2 Objectives
Our project was centered on comprehending road accident patterns within the UK for the year 2022. We divided our analysis more specifically into four key components to understand their individual impacts on road accidents in 2022 in the UK: the utilization of spatial analysis to identify and understand accident hotspots, the examination of temporal trends, demographics, vehicle characteristics and their relation to accident severity. This data science project served as an opportunity for us to apply our newly acquired skills in exploratory data analysis, data analysis, and statistical modeling to real-world data, aiming to contribute to a better understanding of road safety in the UK for the year 2022.
1.3 Related Work:
Various sources served as inspirations for our research project on UK road accidents in 2022.
In our view, Vox News stood out as a leader in this field, excelling in the art of presenting data to facilitate storytelling and deliver valuable insights to their audience. We were inspired by the way they craft their videos, articles, and website, which we believe are exemplary in providing informative and engaging content.
We also drew inspiration for the themes to address in our study from the official UK government website. This authoritative source provided us with essential data and insights related to road safety and accidents in the UK, ensuring that our research is firmly grounded in government statistics and policies.
1.4 Research questions
We began our project with a primary focus on investigating spatial, temporal, demographic, and vehicle trends in 2022. However, as we dived deeper into our research (especially after our exploratory data analysis), we recognized the opportunity to add another dimension to our analysis by examining how these factors were associated with the severity of accidents. Therefore, our research questions had a dual purpose: firstly, to gain insights into the fundamental aspects of accidents, and secondly, to explore their impact on accident severity. This approach introduced added complexity to our analysis, but we believe it yielded even more intriguing insights from our complex datasets.
a) Can we identify locations (geographical (coordinate based) and spatial (urban vs rural, highway vs one way road) that are more prone to accidents?
b) What temporal trends (time of day/month/week, seasonality) can we identify in road accidents in the UK ? Can we identify the most dangerous times to be on the road?
c) What are the most prevalent demographics (age, gender) and vehicles characteristics (engine size, electric vs petrol, etc.) of road accidents?
d) Can we predict the severity of an accident using significant variables in our dataset and validate its accuracy?
2. Data
2.1 Sources
We extracted our Road Safety Data from the UK Government’s data sets for 2022, focusing specifically on road incidents. To answer our three research questions, we used four main data sets, each providing essential details about 2022 UK road accidents.
Each dataset Collision, Vehicle and Casualty contain a “accident_index” which links each of the datasets together. Furthermore, both the casualty and vehicle datasets contain a “reference” as well, which is assigned to each vehicle or casualty involved in an accident within a collision. This allows for more granular analysis, such as if we want to identify which casualties were in which vehicle within a certain accident. The relationship between these three datasets can be seen in the graphic below.
In the table above, we can observe the inter-relationship between our datasets. Each row in the collision dataset (thus forward named accident dataset for simplicity) represents a single accident occurrence. Each row in the vehicle dataset represents a unique vehicle associated to this collision (it’s important to note that there can be multiple vehicles per a single accident). Furthermore, in the casualty dataset, we have an individual row per casualty. The vehicle reference is apparent in both the vehicle and casualty dataset allowing us to be able to identify which casualty was in which vehicle and in which accident.
2.2 Description of Datasets
DS 1: Collision Statistics
Description:
This data set encompasses detailed information on each road accident in the UK. It includes a comprehensive range of features such as the location of the accident, the time it occurred, the severity of the collision, and the type of road involved. This data set is crucial as it provides a holistic view of all road collisions that occurred in the specified period.
Code
# This code imports the accident_df from the file directory - which is the file where we are keeping the document. accident_data_path <- here::here("data", "dft-road-casualty-statistics-collision-2022.csv")accident_df <-read.csv(accident_data_path)# Here we are creating a descriptive data frame for the collision datasetaccident_columns <-data.frame("Name"=names(accident_df),"Type"=sapply(accident_df, function(column) class(column)[1]), "Example"=sapply(accident_df, function(column) column[1]), "Explanation"=c("A unique value for each accident. It combines the accident_year and accident_reference to form a unique ID. It can be used to join with the Vehicle and Casualty datasets.","The year in which the accident occurred.","An ID used by the police to reference a collision within a specific year. It is not unique outside of the year, so accident_index should be used for linking to other years.","The easting coordinate of the accident location in OSGB36 National Grid format. It may be null if the location is not known.","The northing coordinate of the accident location in OSGB36 National Grid format. It may be null if the location is not known.","The longitude coordinate of the accident location. It may be null if the location is not known.","The latitude coordinate of the accident location. It may be null if the location is not known.","The police force responsible for the area where the accident occurred. It is represented as a numerical code and corresponds to different police forces.","The severity of the accident, categorized as: 1: Fatal 2: Serious 3: Slight","The number of vehicles involved in the accident.","The total number of casualties (injuries or fatalities) in the accident.","The date of the accident in DD/MM/YYYY format.","The day of the week when the accident occurred, categorized as: 1: Sunday 2: Monday 3: Tuesday 4: Wednesday 5: Thursday 6: Friday 7: Saturday","The time at which the accident occurred, represented in hours and minutes.","The local authority district where the accident occurred. It is represented as a numerical code corresponding to different districts.","The local authority district in ONS (Office for National Statistics) code format where the accident occurred.","The local authority responsible for the highway where the accident occurred.","The classification of the first road involved in the accident, categorized as: 1: Motorway 2: A(M) 3: A 4: B 5: C 6: Unclassified -1: Data missing or out of range","The number of the first road involved in the accident, or unknown if not available. It depends on the road class.","The type of road where the accident occurred.","The speed limit on the road where the accident occurred. Valid values are 20, 30, 40, 50, 60, or 70. Other values represent data missing or out of range.","Details about the type of junction where the accident occurred.","Information about the control at the junction where the accident occurred.","Similar to first_road_class, but for the second road involved in the accident.","Similar to first_road_number, but for the second road involved in the accident.","Information about pedestrian crossing control within 50 meters of the accident, categorized as: 0: None within 50 meters 1: Control by school crossing patrol 2: Control by other authorized person","Information about pedestrian crossing physical facilities within 50 meters of the accident.","The light conditions at the time of the accident.","The weather conditions at the time of the accident.","The road surface conditions at the time of the accident","Special conditions at the accident site","Hazards on the carriageway at the accident site.","Indicates whether the accident occurred in an urban or rural area.","Indicates whether a police officer attended the scene of the accident.","Indicates whether the road is a trunk road managed by Highways England or non-trunk.","For England and Wales only, this field provides information about the Lower Layer Super Output Area (LSOA) of the accident location." ))# This code is executing the table that we had created previously, we took off row names, and added a title, the class is in reference to the visual format we want. datatable(accident_columns, rownames =FALSE, caption ="Collision Dataset Variables", class ='cell-border stripe')
Key Features:
Includes 36 different variables.
Contains over 106004 observations.
Relevance:
Used for research questions 1-4
DS 2: Vehicle Statistics
Description:
This data set offers detailed information concerning all vehicles involved in collisions and their drivers within the UK in 2022. It encompasses a variety of data points, including but not limited to, the type of vehicle involved, its engine capacity, etc. This dataset is instrumental in understanding the impacts of specific vehicle characteristics on road collisions in the UK.
Code
# For documentation on this cell block, please refer to the previous one (accident_df)vehicle_data_path <- here::here("data", "dft-road-casualty-statistics-vehicle-2022.csv")vehicle_df <-read.csv(vehicle_data_path)vehicle_columns <-data.frame("Name"=names(vehicle_df),"Type"=sapply(vehicle_df, function(column) class(column)[1]), "Example"=sapply(vehicle_df, function(column) column[1]), "Explanation"=c("A unique value for each accident. It combines the accident_year and accident_reference to form a unique ID. It can be used to join with the Vehicle and Casualty datasets.","The year in which the accident occurred.","An ID used by the police to reference a collision within a specific year. It is not unique outside of the year, so accident_index should be used for linking to other years.","An ID assigned to each vehicle involved in an accident within the same collision.","The type of vehicle involved in the accident. See code/format for vehicle type mapping.","Indicates whether the vehicle was towing or articulated in some way. See code/format for towing and articulation mapping.","Describes the manoeuvre of the vehicle before the accident. See code/format for vehicle manoeuvre mapping.","The direction from which the vehicle was traveling before the accident. See code/format for vehicle direction mapping.","The direction to which the vehicle was traveling before the accident. See code/format for vehicle direction mapping.","The location of the vehicle on the road, including restricted lanes. See code/format for vehicle location mapping.","Indicates whether the vehicle skidded or overturned during the accident. See code/format for skidding and overturning mapping.","Indicates if the vehicle hit an object in the carriageway during the accident. See code/format for hit object in carriageway mapping.","Indicates if the vehicle left the carriageway during the accident. See code/format for vehicle leaving carriageway mapping.","Describes the first point of impact on the vehicle. See code/format for first point of impact mapping.","Indicates whether the vehicle is left-hand drive or not. See code/format for left-hand drive mapping.","The purpose of the driver's journey. See code/format for journey purpose of driver mapping.","The sex of the driver. See code/format for sex of driver mapping.","The age of the driver.","The age band of the driver. See code/format for age band of driver mapping.","The engine capacity of the vehicle in cubic centimeters (cc).","The propulsion code of the vehicle. See code/format for propulsion code mapping.","The age of the vehicle.","The make and model of the vehicle.","The IMD (Index of Multiple Deprivation) decile of the driver's residence. See code/format for IMD decile mapping.","The type of driver's home area. See code/format for driver home area type mapping.","The LSOA (Lower Layer Super Output Area) of the driver's residence.","An ID used to reference a collision within a specific year.","The location of the junction where the accident occurred." ))datatable(vehicle_columns, rownames =FALSE, caption ="Vehicle Dataset Variables", class ='cell-border stripe')
Key Features:
Includes 28 distinct variables.
Contains over 193545 observations.
Relevance:
Used for research questions 3 & 4
DS 3: Casualty Statistics
Description:
This dataset delivers in-depth insights into casualties resulting from road accidents in the UK during 2022. It contains a broad spectrum of variables, including but not limited to, casualty age, gender, the severity of their injuries, type of casualty (such as driver/passenger or pedestrian). This dataset is essential for a detailed analysis on individual’s demographics and their impact on road accidents.
Code
# For documentation on this cell block, please refer to the previous one (accident_df)casualty_data_path <- here::here("data", "dft-road-casualty-statistics-casualty-2022.csv")casualty_df <-read.csv(casualty_data_path)explanations <-c("A unique value for each accident. It combines the accident_year and accident_reference to form a unique ID. It can be used to join with the Vehicle and Casualty datasets.","The year in which the accident occurred.","An ID used by the police to reference a collision within a specific year. It is not unique outside of the year, so accident_index should be used for linking to other years.","An ID assigned to each vehicle involved in an accident within the same collision.","An ID used to reference a casualty within a specific accident.","The class of the casualty. See code/format for casualty class mapping.","The gender of the casualty. See code/format for gender mapping.","The age of the casualty.","The age band of the casualty. See code/format for age band mapping.","The severity of the casualty. See code/format for casualty severity mapping.","The location of the pedestrian during the accident. See code/format for pedestrian location mapping.","The movement of the pedestrian during the accident. See code/format for pedestrian movement mapping.","Indicates whether the casualty was a car passenger.","Indicates whether the casualty was a bus or coach passenger.","Indicates whether the casualty was a pedestrian road maintenance worker.","The type of casualty. See code/format for casualty type mapping.","The home area type of the casualty. See code/format for home area type mapping.","The IMD (Index of Multiple Deprivation) decile of the casualty's residence. See code/format for IMD decile mapping.","The Lower Layer Super Output Area (LSOA) of the casualty's residence.")casualty_columns <-data.frame("Name"=names(casualty_df),"Type"=sapply(casualty_df, function(column) class(column)[1]), "Example"=sapply(casualty_df, function(column) ifelse(length(column) >0, as.character(column[1]), NA)), "Explanation"= explanations )datatable(casualty_columns, rownames =FALSE, caption ="Casualty Dataset Variables", class ='cell-border stripe', options =list(pageLength =10))
Key Features:
Includes 19 distinct variables.
Contains a total of 135480 observations.
Relevance:
Used for research questions 3 & 4
DS 4: Legend
Description:
This dataset provides the legends to the fields present in our three accident data sets. It is a key resource for understanding and interpreting the data, especially in identifying and decoding missing values. It includes detailed information on what specific values signify a “missing value” or “other” in each column, thereby facilitating accurate data analysis.
table (dataset)
field_name
code/format
label
Vehicle
propulsion_code
1
Petrol
Vehicle
propulsion_code
2
Heavy oil
Vehicle
propulsion_code
3
Electric
Example of Legend dataset
Key Features:
Consists of one large table
Relevance:
Allows understanding of data sets 1-3.
2.3 Data Preparation
2.3.1 Joining Datasets
One of the primary challenges associated with the data provided by the UK Government was its format. Combining all information from the three datasets into a single dataset would have been a daunting task, necessitating the sacrifice of critical details. A notable limitation of this dataset-splitting approach used by the UK Government was the absence of granular data on casualties and vehicles within the accident database, as well as the unavailability of vehicle information in the casualty database. This limitation became apparent when we encountered difficulties in obtaining more detailed information about specific accidents. Therefore, in the future, it would be advisable to procure a dataset that consolidates accident, casualty, and vehicle characteristics into a single, unified dataset. This approach would help eliminate potential biases and facilitate seamless cross-referencing across all three data sets.
Examples of Injuries per Severity
Severity Classification
Injuries Sustained
Fatal
Deceased
Serious
Broken neck or back, Severe head injury, unconscious, Severe chest injury, any difficulty breathing, Internal injuries, Multiple severe injuries, unconscious, Loss of arm or leg (or part), Fractured pelvis or upper leg, Deep penetrating wound, Multiple severe injuries,
Slight
Whiplash or neck pain, Shallow cuts / lacerations / abrasions, Sprains and strains, Bruising, Shock
To address these issues to the best of our abilities, we took specific steps before proceeding with our analysis to limit any potential biases. We aimed to introduce more detailed information on severity and vehicle types across all datasets (see table for precise details on each severity level). Initially, the accident database contained an “accident_severity” column that indicated the worst severity of the accident. For instance, if an accident resulted in one fatality and 16 slight injuries, the column would display “Fatal.” However, this representation was not statistically suitable for our analysis, as it would be completely biased for running statistical tests and regressions. Therefore, through data wrangling techniques we calculated and added three new columns to both the accident and vehicle databases. These additional columns provided the counts of slight, fatal, and serious casualties, offering a more nuanced view of accident severity.
Furthermore, we conducted a similar data wrangling process to enhance the accident database by adding columns that indicated the quantity of “Cars, Motorcycles, Trucks Cyclists, and”Other” vehicles involved in each accident. This data integration and transformation process significantly improved the quality and granularity of our dataset. Once this integration was complete, we could begin our analysis to identify any missing values that needed to be addressed.
Example process of merging Casualty and Collision Datasets together
2.3.2 Missingness of Data
Our initial examination of the datasets revealed a lack of uniformity in the indicators used for missing values. Rather than a single standard marker, various characters, unique to each row, signified the absence of data. This inconsistency introduced an additional challenge, necessitating the identification and correct interpretation of these distinct characters for each variable, to accurately assess missing values. For example, we observed that three variables in our data set employed different characters to denote missing information.
Example of missing values per different columns
Variable Name
Missing Value Character
junction_detail
-1 or 99
junction_control
-1 or 9
second_road_number
-1 or 0
We therefore established a “missing values dictionary” that holds the name of missing values for all variables in our three data sets, namely Accident Statistics, Casualty Statistics, and Vehicle Statistics. Subsequently, we utilized this dictionary to individually identify and replace the missing values with “N/A” in each of the columns across all three datasets, employing a dedicated function for this purpose.
The visualizations presented below illustrate the distribution of missing values within our datasets. We were able to accomplish this through the utilization of the visdat package, which provided us with insights into the extent of missing data within our data frames.
We noticed that both the accident and vehicle datasets had the highest proportion of missing values: 713720 individual missing values (11.17%) and 338314 individual missing values (7.25%) respectively. However thankfully, it can also be noted that our casualty dataset was less impacted by missing values 61976 missing values (2.29%). (N.B - we also checked for duplicate rows, in which we found that there are 0 duplicate rows in the dataset.)
Note
It is important to note that the column local_authority_district is completely missing, therefore we will go ahead and delete this column directly.
This discovery led us to realize that the extent of missing values across various variables was greater than initially anticipated. Considering that not every variable across every dataset was essential for each research question, we decided to construct smaller, question-specific data sets. This approach enabled us to strategically remove rows with missing values from these focused data sets (rather than the original data sets), based on their relevance to the specific research question. By doing so, we selectively eliminated missing data, only from variables critical to a particular question, thereby minimizing overall data loss.
2.3.3 Formatting our Data
Also using the visdat package we used the vis_miss function to visualize both the missingness as well as provides a visual reference regarding the class of each of the columns. Through the use of this tool we noticed that many columns in our datasets were in character or integer format. We corrected this by converting columns like date, time, latitude, and longitude into their appropriate data types. For example, latitude and longitude columns were changed to numerical classes.
2.3.4 Creating mini datasets for each research question
As mentioned previously, the discovery of the extent of missing values within our three datasets was greater than initially anticipated. Considering that not every data set and variable was essential for each research question, we decided to construct smaller, question-specific data sets. This approach enabled us to strategically remove rows with missing values from these focused data sets (rather than the original data sets), based on their relevance to the specific research question. By doing so, we selectively eliminated missing data only from variables critical to a particular question, thereby minimizing overall data loss. We determined that despite increasing complexity, it was the right choice moving forward. Traditionally other steps can be taken to replace missing values within ones data set through a method named imputation, where one could take the median for example. However given that our dataset is in regards to accidents, and given it’s categorical nature, this is not possible. We have therefore determined that our best course of action will be to delete these missing values.
Research Question 1 Dataset:
Code
# This code manually selects the variables that we wanted to keep from the accident_df to be put in our dataset that we will be continuing with. If you'd like to add a variable to the dataset which you'd like to work with later, make sure you put it here. My suggestion is to import the legend dataset, which will explain the meaning of the variable and what each individual code that it has means. df1var <-c("accident_index", "accident_reference", "longitude", "latitude", "location_easting_osgr","location_northing_osgr", "lsoa_of_accident_location","urban_or_rural_area","road_type", "first_road_class", "second_road_class", "accident_severity","num_fatal", "num_serious", "num_slight", "weather_conditions", "road_surface_conditions","special_conditions_at_site", "number_of_vehicles", "number_of_casualties", "speed_limit", "date", "time", "special_conditions_at_site", "light_conditions","Motorcycle", "Trucks", "Car", "Other", "Cyclist") q1_clean <- accident_df[,df1var]
For the first dataset, we selected all relevant variables for our Research Question 1, aiming to identify why certain locations were more prone to accidents through spatial analysis. We, therefore, decided to include geographical (location-specific) factors such as latitude and longitude, as well as spatial characteristics such as road types, rural vs. non-rural, road surface, and weather conditions. The number of missing values in our newly created dataset was 17146 / (0.54%), and they were consequently deleted.
In the second dataset (Research Question 2), we selected all potential relevant variables for temporal analysis, such as date, time, day of the week, and accident severity, to explore temporal patterns and trends. The number of missing values in our newly created dataset was 7381 / (0.44%), and they were deleted.
In the third dataset (Research Question 3) ‘Part A,’ we selected all potential relevant variables to assist in our demographic analysis, including age, gender, and the Index of Multiple Deprivation (a measure of deprivation within the UK). The number of missing values in our newly created dataset was 40166 / (1.74%), which will be deleted.
In the third dataset (Research Question 3) ‘Part B,’ we selected all relevant variables for vehicle analysis, including whether the vehicle was left-hand drive, the generic brand, age of the vehicle, etc. The number of missing values in our newly created dataset was 229518 / (8.47%), and they were deleted. However an important thing to note here is that since we have bicycles in our dataset, they don’t contain information regarding traditional vehicles containing motors, such as engine capacity, etc. Therefore these are called MNAR or missing not at random. We will keep these missing values, and when the time comes, just simply filter out bicycles. This is why we have some missing values remaining in this dataset.
accident_index
date
accident_reference
vehicle_type
skidding_and_overturning
vehicle_left_hand_drive
engine_capacity_cc
propulsion_code
age_of_vehicle
lsoa_of_driver
num_fatal
num_serious
num_slight
sex_of_driver
2022010352601
01/01/2022
010352601
1
0
1
NA
NA
NA
NA
0
0
1
1
Research Question 4 Dataset:
The fourth dataset will be created after analyzing datasets 1 to 3, enabling us to determine the most significant variables for our regression analysis.
Code
# Finding columns that are common in both accident and vehicle datacommon_columns <-intersect(names(accident_df), names(vehicle_df))print(common_columns) # Just printing them out to see what's common#> [1] "accident_index" "accident_year" "accident_reference"#> [4] "date" "num_fatal" "num_serious" #> [7] "num_slight"# Creating a unique vehicle dataframe by droppin common columns except 'accident_index' and 'vehicle_reference'vehicle_df_unique <- vehicle_df %>%select(-all_of(common_columns[common_columns !="accident_index"& common_columns !="vehicle_reference"]))# Now time to join the dataframes# First joining casualty with accident data# Then adding the unique vehicle data to itlogit_data <- casualty_df %>%left_join(accident_df, by ="accident_index") %>%left_join(vehicle_df_unique, by =c("accident_index", "vehicle_reference"))# Get rid of duplicate columns that end with .x and .ylogit_data <- logit_data %>%select(-matches("\\.x$"), -matches("\\.y$"))# Now some more data transformationlogit_data <- logit_data %>%mutate(# Changin date to Date format, if it ain't alreadydate =ymd(date, quiet =TRUE),# Creatin a new column for vehicle categories based on vehicle typesvehicle_category =case_when( vehicle_type %in%c(1) ~"Cyclist", vehicle_type %in%c(2, 3, 4, 5, 23, 97, 103, 104, 105, 106) ~"Motorcycle", vehicle_type %in%c(8, 9, 108, 109) ~"Car", vehicle_type %in%c(19, 20, 21, 98, 113) ~"Trucks",TRUE~"Other" ) ) %>%# Keeping only certain categories of vehicles and casualtiesfilter(vehicle_category %in%c("Cyclist", "Motorcycle", "Car", "Trucks")) %>%filter(casualty_class %in%c(1, 2)) %>%# Extracting year, month, day, and hour from the date and timemutate(year =year(date),month =month(date),day =day(date),hour =as.numeric(substr(time, 1, 2)) ) %>%# Selecting specific columns for the final datasetselect( accident_index, year, month, day, hour, accident_severity, number_of_vehicles, number_of_casualties, road_type, speed_limit, junction_detail, junction_control, light_conditions, propulsion_code, weather_conditions, road_surface_conditions, urban_or_rural_area, vehicle_category, age_of_vehicle, engine_capacity_cc, car_passenger, first_point_of_impact, vehicle_left_hand_drive, driver_imd_decile, hit_object_off_carriageway )
Code
logit_data <-na.omit(logit_data)kable(head(logit_data, n =1))
As you can see here, we utilized the DataExplorer package to implement the plot_intro function. This allowed us to visually inspect our four minidatasets to ensure that they were clean prior to proceeding with the data wrangling phase of our project. As you can see in the tables above, all four of our datasets, have 0% missing columns nor missing observations. We can therefore move forward to the next step.
2.3.5 Feature Engineering
To improve the clarity and interpretability of our dataset, we engaged in a process of feature transformation. This involved converting various numerical variables into categorical variables, a technique often referred to as categorical encoding. This method is distinct from ‘one-hot encoding,’ which is a specific approach used mainly in preparing data for machine learning models. In our case, the transformation primarily entailed assigning descriptive labels to numerical codes, thereby enhancing the readability and comprehension of the data. This transformation was particularly beneficial for visual data analysis and interpretation. It allowed us to represent data elements like days of the week, accident severity, and time ranges with meaningful labels instead of mere numerical codes. On top of this, given the large number of encoded values in our datasets, such as vehicle types or road types, incorporating this level of granularity became paramount. It served as a crucial aid for readers, as comprehending the data would have been exceptionally challenging otherwise. This enhancement not only rendered our graphs more intuitive but also made data tables more comprehensible, eliminating the need for supplementary labeling or explanations in our visual representations.
Code
# This code is adding day names to day_of_week - replacing the 1,2,3,etc (this is called a FACTOR)q2_clean$day_name <-factor(q2_clean$day_of_week, levels =c(1, 2, 3, 4, 5, 6, 7),labels =c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))# Here we are chaing the format of the dateq2_clean$month_name <-format(as.Date(q2_clean$date), "%B")# This code is adding day names to the monthq2_clean$month_name <-factor(q2_clean$month_name, # Setting it as a factor - ergo in order for our future graphslevels =c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))# Here we are doing the same for the severity level - 1 is fatal, 3 is slight, etcq2_clean$accident_severity_chr <-factor(q2_clean$accident_severity, levels =c(1, 2, 3),labels =c("Fatal", "Serious", "Slight"))# This nifty code is looking at the hour column, it's then taking it and creasing a column with the corresponding time range. This is very important for when we are going to analyse the time q2_clean <- q2_clean %>%mutate(time_ranges =case_when( hour >=0& hour <6~"0-6 AM", # When the hour is bigger than or equal to 0 and smaller than 6AM hour >=6& hour <12~"6-12 AM", hour >=12& hour <18~"12-6 PM", hour >=18~"6-12 PM" ),time_ranges =factor(time_ranges, levels =c("0-6 AM", "6-12 AM", "12-6 PM", "6-12 PM")) # This then puts them in a factor - or in other words reorders them. This ensures that when we run any statistical tests, or regressions, the outputs will come out in THIS order. Which makes interpreting ALOT easier. Make sure you do this. )# This creates a column named casualty_class_chr which is showing the corresponding casualty reference but in a character - i.e. was the individual a passenger, or driver. q3a_clean$casualty_class_chr <-factor(q3a_clean$casualty_reference,levels =c(1,2,3),labels =c("Driver/Rider", "Passenger", "Pedestrian"))# Here we are doing the same for the severity level - 1 is fatal, 3 is slight, etcq3a_clean$casualty_severity_chr <-factor(q3a_clean$casualty_severity,levels =c(3,2,1),labels =c("Light", "Serious", "Fatal"))# This creates a column named sex_chr which is showing the corresponding casualty sex based ont he value in sex_of_casualty columnq3a_clean$sex_chr <-factor(q3a_clean$sex_of_casualty,levels =c(1,2),labels =c("Male", "Female"))#This is for when we create the age groups - this are the bins that we will be using (0 - 18, 18 - 25, etc)age_bins <-c(0, 18, 25, 35, 50, 65, Inf)# Define age group labelsage_labels <-c("Young (0-18)", "Young Adult (19-25)", "Adult (26-35)", "Middle-Aged (36-50)", "Senior (51-65)", "Old (66+)")# Creating the "age groups" columnq3a_clean$age_groups <-cut(q3a_clean$age_of_casualty, breaks = age_bins, labels = age_labels, include.lowest =TRUE)# Convert "age_groups" to a factor with custom labelsq3a_clean$age_groups <-factor(q3a_clean$age_groups, levels = age_labels)#This is creating a column named casualty_type_chr which has the corresponding casualty type based on it's encoding from the casualty_type column --> refer to legend for the explination of all of these codes and their meaningsq3a_clean$casualty_type_chr <-factor(q3a_clean$casualty_type,levels =c(0,1,2,3,4,5,8,9,10,11,16,17,18,19,20,21,22,23,90,97,98,99),labels =c("Pedestrian", "Cyclist", "Moto [50cc]", "Moto [125cc]","Moto [125/500cc]", "Moto [+500cc]", "Taxi/Private Car","Car", "Minibus", "Bus/Coach", "Horse Rider", "Agri", "Tram", "Van/Goods [<3.5ton]","Van/Goods [3.5/7.5tons]", "Van/Goods [>7.5ton]","Mobility Scooter", "Electric Moto", "Other Vehicle","Moto [unk CC]", "Van/Goods [unk ton]", "Unknown"))# Here we are doing as we did previously, which is creating a vehicle category based on the corresponding vehicle type code (This was done for the datasets previously and might be redundant - please check and use accordingly)q3b_clean <- q3b_clean %>%mutate(vehicle_category =case_when( vehicle_type %in%c(1) ~"Cyclist", vehicle_type %in%c(2, 3, 4, 5, 23, 97, 103, 104, 105, 106) ~"Motorcycle", vehicle_type %in%c(8, 9, 108, 109) ~"Car", vehicle_type %in%c(19, 20, 21, 98, 113) ~"Trucks",TRUE~"Other"# For vehicle types that don't fall into these categories ))# Here we are adding a column to our 3b dataset which is the month, based on the date column in the dataset. q3b_clean$month <-month(as.Date(q3b_clean$date, format ="%d/%m/%Y"))# Here we are encoding the vehcile type in character from the vehicle type encoded column -> please refer to the legend for the meanings - as this might change in the future if using a new dataset. q3b_clean <- q3b_clean %>%mutate(vehicle_type_chr =case_when( vehicle_type ==1~"Pedal cycle", vehicle_type ==2~"Motorcycle 50cc and under", vehicle_type ==3~"Motorcycle 125cc and under", vehicle_type ==4~"Motorcycle over 125cc and up to 500cc", vehicle_type ==5~"Motorcycle over 500cc", vehicle_type ==8~"Taxi/Private hire car", vehicle_type ==9~"Car", vehicle_type ==10~"Minibus (8 - 16 passenger seats)", vehicle_type ==11~"Bus or coach (17 or more pass seats)", vehicle_type ==16~"Ridden horse", vehicle_type ==17~"Agricultural vehicle", vehicle_type ==18~"Tram", vehicle_type ==19~"Van / Goods 3.5 tonnes mgw or under", vehicle_type ==20~"Goods over 3.5t. and under 7.5t", vehicle_type ==21~"Goods 7.5 tonnes mgw and over", vehicle_type ==22~"Mobility scooter", vehicle_type ==23~"Electric motorcycle", vehicle_type ==90~"Other vehicle", vehicle_type ==97~"Motorcycle - unknown cc", vehicle_type ==98~"Goods vehicle - unknown weight", vehicle_type ==99~"Unknown vehicle type (self rep only)", vehicle_type ==103~"Motorcycle - Scooter (1979-1998)", vehicle_type ==104~"Motorcycle (1979-1998)", vehicle_type ==105~"Motorcycle - Combination (1979-1998)", vehicle_type ==106~"Motorcycle over 125cc (1999-2004)", vehicle_type ==108~"Taxi (excluding private hire cars) (1979-2004)", vehicle_type ==109~"Car (including private hire cars) (1979-2004)", vehicle_type ==110~"Minibus/Motor caravan (1979-1998)", vehicle_type ==113~"Goods over 3.5 tonnes (1979-1998)", vehicle_type ==-1~"Data missing or out of range",TRUE~NA_character_# Default case ) )# Here we are factoring the car passenger and adding characters from the code. q3a_clean$car_passenger_chr <-factor(q3a_clean$car_passenger,levels =c(0, 1, 2, 9, -1),labels =c("Not car passenger", "Front seat passenger", "Rear seat passenger","unknown", "missing"))
2.3.6 Transformation of LSOA to UTLA:
Upon exploring our data set, we noticed the presence of a column containing LSOA (Lower Layer Super Output Areas), a geographic hierarchy designed to enhance the reporting of small area statistics in the UK and Wales, assigned to each accident. However, given the fact that they were 33,755 different LSOA’s present in the UK in 2021, attempting to visualize and analyze such a multitude of regions would have proven exceedingly challenging. Therefore, we decided to aggregate these LSOA’s at the UTLA level (Upper Tier Local Authorities) encompassing larger regions (217 regions) compared to LSOAs, to facilitate the derivation of more generalized insights.
Unfortunately, our dataset lacked information regarding UTLAs, requiring us to import a conversion dataset containing the corresponding UTLA codes for each LSOA (ONS Geography Office for National Statistics, 2020). We joined the information from this dataset to our primary spatial dataset and added a column representing the corresponding UTLA code and name. This step proved extremely difficult, given that the UK changes the coding for both LSOA’s and UTLA’s regularly. After multiple weeks of debugging, and conducting thorough research, we discovered that the dataset we used was from 2011 – 2017, and UTLA codes had changed in 2019. To address this discrepancy, we utilized a more recent dataset to align with our 2022 data. Once this issue was rectified, we were able to subsequently import a GeoJSON file containing the spatial boundaries of each of our UTLA’s for visual exploration and analysis (DLUCH GIS Team Ministry of Housing, Communities and Local Government, 2019).
3. Exploratory Data Analysis
3.1 Spatial Exploration
In this section, we explored the spatial variables that exhibited the strongest associations with accident occurrences for each vehicle type. We also provided a comprehensive explanation of the logical progression that led us to our analysis in subsequent parts of the report.
3.1.1 What are the most common road accident characteristics ?
First, we decided to look at the proportional distribution of road accidents by vehicle type relative to various spatial variables: road classifications, types, speed limitations and rural or urban nature of the road.
Conducting this first exploratory analysis represented our initial step in gaining a high-level understanding of our dataset and how and where accidents were distributed across the UK in 2022. It provided us with an initial snapshot of the situation, allowing us to make assumptions about the factors that influenced accident distribution as well as refine our exploration in the subsequent steps.
Road Types
A first assumption was that there could be a significantly higher proportion of roads being single carriageways, serving as an initial explanation for the prevalence of accidents on these roads.
A second potential assumption was that the elevated danger on single carriageways could have arisen from vehicles traveling in opposite directions without a barrier to prevent collisions. These roads also have a high-speed limit of 60mph (~97 km/h) which might increase the risks associated.
Expand to learn more about road types
A single carriageway (sometimes spelled as “carriageway”) is a type of road that consists of a single roadway with one or more lanes for vehicles traveling in each direction. In other words, it has only one lane of traffic for each direction. In contrast a dual carriageway is a type of road that features two separate carriageways (roadways), each with multiple lanes, for traffic traveling in opposite directions. These carriageways are typically divided by a barrier that prevents direct interaction between vehicles traveling in opposite directions.
A dual carriageway is a type of road that features two separate carriageways (roadways), each with multiple lanes, for traffic traveling in opposite directions. These carriageways are typically divided by a barrier that prevents direct interaction between vehicles traveling in opposite directions.
A slip road is a short road that allows vehicles to join or leave a main road without stopping.
We also noted the higher proportion of cyclist accidents occurring in roundabouts, which might be attributed to the difficulty cyclists face in clearly indicating their directions or drivers failing to check their blind spots before exiting roundabouts.
Road Classes
This table reveals that at least 40% of accidents in our dataset occurred on A – Major Roads for all vehicle types. However, A-Major Roads only accounting for 12% of the total road network across the UK, suggests a disproportionately high rate of accidents on these roads. As illustrated in the subsequent graph, B, C, and U-roads are significantly more prevalent in the UK than A-roads (Department for Transport, 2020).
Given that A-roads serve as links between regional towns and cities, it is reasonable to anticipate higher traffic density and increased usage by road users which could explain the elevated percentage of accidents that occurred on these roads.
Expand to learn more about road classes
A-Roads:
Major roads between regional towns and cities.
Can be single or dual-carriageway.
Can be found in urban and rural areas.
B & C Roads:
Minor roads connecting small towns and villages.
Usually single carriageway with two lanes. White signs with black text.
Motorways:
High-speed roads linking major towns and cities.
Always have three lanes and two carriageways - with a safety barrier to protect from oncoming traffic.
Speed limit is typically 70mph.
No pedestrians, bicycles, or slow vehicles allowed.
Speed Limits
This table indicates that the majority of accidents, proportionally, took place on roads with a 30 mph (~48 km/h) speed limit for all vehicle types.
Conversion table from mph to kmh
miles per hour - mph
kilometers per hour - km/h
20 mph
~32 km/h
30 mph
~48 km/h
40 mph
~64 km/h
50 mph
~80 km/h
60 mph
~97 km/h
70 mph
~113 km/h
While this aligned with our expectations for cyclists who predominantly navigate urban areas with lower speed limits, it was somewhat unexpected that nearly 50% of accidents occurred on such low-speed roads, particularly for trucks. This observation raised the possibility that traffic density, maneuvering challenges, or decreased driver attentiveness may have contributed to this trend, though these are speculative assumptions.
Urban vs. Rural Areas
This table indicates that most accidents across all vehicle types occurred in urban environments. Cyclists experienced the highest rate of accidents in cities (83.2%), which aligns with their frequent use of city roads. Interestingly, both cars (70.1%) and motorcycles (71%) showed a similar percentage of urban accidents. This similarity suggests that car and motorcycle usage was predominantly urban and that most accidents might not have occurred much during long-distance travel, but rather in dense traffic and complex road situations. Conversely, trucks exhibited a more balanced distribution of urban and rural accidents, likely reflecting their association with long-distance travel.
3.1.2 Per vehicle type - where do accidents occur the most ?
Code
#This code sets out to map in small little points, all the accidents per the different vehicle types. accident_data_sf <-st_as_sf(q1_clean, coords =c("longitude", "latitude"), crs =4326) # This is converting our q1_clean ( our newly created dataset) into a geospatial data frame which is what is used after for putting these on the graph. The crs = 4326 is the setting the coordinate reference system (CRS) which is something that apparently is commonly used for coordinates, I found it on the documentation online. Feel free to change this if desired. uk <-ne_countries(scale ="medium", country ="united kingdom", returnclass ="sf") # Here we need to put our points on something. This code finds us a map of the UK and gets it in a medium size (you could make it ALOT bigger which would increase the detail if desired). # Here we define the geographic boundaries for zooming in on the UK - this is the "default" viewing of the map xlims <-c(-7, 2) # Longitude limitsylims <-c(49.5, 56) # Latitude limits# Here we are adjusting the theme for black theme_black_bg <-function() {theme_minimal() +theme(plot.background =element_rect(fill ="#212529", color =NA),panel.background =element_rect(fill ="#212529", color =NA),plot.title =element_text(color ="white"), # Changed to whiteaxis.title =element_text(color ="white"), # Changed to whiteaxis.text =element_text(color ="white"), # Changed to whitelegend.title =element_text(color ="white"), # Changed to whitelegend.text =element_text(color ="white"), # Changed to whitelegend.position ="none" )}# Here we are plotting for car accidentscar_data <- accident_data_sf %>%filter(Car >0)car_plot <-ggplot() +geom_sf(data = uk, fill ="#212529", color ="#BBBBBB") +geom_sf(data = car_data, color ="white", size =0.00000000001) +# Smaller white pointscoord_sf(xlim = xlims, ylim = ylims, expand =FALSE) +labs(title ="Car Accidents") +theme_black_bg()# Here we are plotting for motorcycle accidentsmotorcycle_data <- accident_data_sf %>%filter(Motorcycle >0)motorcycle_plot <-ggplot() +geom_sf(data = uk, fill ="#212529", color ="#BBBBBB") +geom_sf(data = motorcycle_data, color ="white", size =0.00000000001) +# Smaller white pointscoord_sf(xlim = xlims, ylim = ylims, expand =FALSE) +labs(title ="Motorcycle Accidents") +theme_black_bg()# Here we are plotting for truck accidentstruck_data <- accident_data_sf %>%filter(Trucks >0)truck_plot <-ggplot() +geom_sf(data = uk, fill ="#212529", color ="#BBBBBB") +geom_sf(data = truck_data, color ="white", size =0.00000000001) +# Smaller white pointscoord_sf(xlim = xlims, ylim = ylims, expand =FALSE) +labs(title ="Truck Accidents") +theme_black_bg()# Here we are plotting for bicycle accidentscycle_data <- accident_data_sf %>%filter(Cyclist >0)cycle_plot <-ggplot() +geom_sf(data = uk, fill ="#212529", color ="#BBBBBB") +geom_sf(data = cycle_data, color ="white", size =0.00000000001) +# Smaller white pointscoord_sf(xlim = xlims, ylim = ylims, expand =FALSE) +labs(title ="Cycling Accidents") +theme_black_bg()# Print the plotscar_plotmotorcycle_plottruck_plotcycle_plot
Now that we identified that road accidents exhibited varying spatial patterns/distribution depending on the vehicle type involved (mostly on A-major roads, single carriage ways, at speeds of 30mph, and in urban conditions), we decided to create maps to explore underlying patterns of road accidents more effectively and see if some areas would potentially be more affected than others.
As anticipated and confirming the insights we gained from the previous tables, these maps revealed that accidents did not necessarily cluster in a uniform manner across the UK for all vehicle types. For example, car accidents appeared to be distributed throughout the entire UK (by examining these maps, one can discern UK’s intricate road network given the substantial number of accidents recorded in 2022), whereas bicycle accidents were predominantly concentrated in urban areas such as London.
Furthermore, our maps highlighted an interesting trend: the London region (bottom right corner) consistently exhibited a notably high density of accidents across all vehicle types. This phenomenon might be attributed to the region’s exceptionally dense population leading to a high volume of traffic and greater interactions between vehicles.
3.1.3 Uncovering UK’s population
Recognizing that certain areas (and cities), like London, had a higher propensity for road accidents, we then created a map to visualize population density, to discern whether the population was distributed uniformly across regions. The analysis, scaled at the UTLA level rather than the LSOA level, revealed distinct variations in population dispersal.
For instance, it was evident that the southeast region of the UK has a considerably higher population density when compared to the central and northern regions. These variations had to be considered to ensure a fair analysis of accident distribution and comparisons between different areas. Additionally, the London area exhibited a relatively lower population density, mainly because it was divided into smaller UTLAs, which further emphasized the importance of putting the data on a comparable scale.
Code
# Join population data with UTLA boundariesutla_boundaries_with_population <-left_join(utla_boundaries, population_by_utla, by =c("ctyua19cd"="UTLA20CD"))# Create a more nuanced color palette using colorQuantilecolor_palette <-colorQuantile("viridis", utla_boundaries_with_population$total_population, n =5)# Create a ggplot mappopulation_map <-ggplot() +geom_sf(data = utla_boundaries_with_population, aes(fill = total_population)) +scale_fill_gradientn(colours = viridis::viridis(5), breaks =pretty_breaks(n =5)(utla_boundaries_with_population$total_population),labels = scales::comma) +labs(fill ="Total Population", title ="Map of Population by UTLA") +theme_void() +# A theme with no axes for a clean maptheme(legend.position ="right") # Position the legend on the rightpopulation_map
In our subsequent analysis, we therefore normalized the accident data by the population of each area. This normalization allowed us to gain deeper insights into the different UK regions and be able to answer our RQ1, which was identifying whether specific regions experienced a disproportionately higher or lower number accidents.
3.2 Temporal Exploration
In this section, we explored the relationship between temporality and accident occurrences. This provided us with insights that guided our subsequent analysis of different timeframes, including hours (time ranges), days of the week, and months.
3.2.1 Daily Exploration
Code
daily_counts <- q2_clean %>%group_by(date) %>%summarise(count =n(), .groups ="drop") %>%mutate(rolling_avg = zoo::rollmean(count, k =30, fill =NA))# Find the day with the maximum number of accidentsmax_accidents_day <- daily_counts %>%filter(count ==max(count)) %>%slice(1) %>%pull(date)# Find days with notably low accident counts (you can adjust the threshold as needed)low_accidents_days <- daily_counts %>%filter(count <quantile(count, probs =0.05)) %>%summarise(date =paste(date, collapse =", "))# Calculate the overall average of accidents per dayoverall_avg <-mean(daily_counts$count, na.rm =TRUE)# Create the ggplot objectp <-ggplot(daily_counts, aes(x = date, y = count)) +geom_line(aes(color ="Daily Counts"), size =0.125) +geom_line(aes(y = rolling_avg, color ="30-Day Rolling Average"), size =0.8) +geom_hline(yintercept = overall_avg, linetype ="dashed", color ="#2CA02C", size =0.8) +labs(title ="Daily Counts, 30-Day Rolling Average, and Overall Average of Road Accidents",x ="Date", y ="Number of Accidents" ) +scale_color_manual(values =c("Daily Counts"="#4C4E4D", "30-Day Rolling Average"="#FE4A49", "Overall Average"="#2CA02C")) +theme_minimal() +theme(legend.position ="right", legend.title =element_blank(), legend.text =element_text(size =8), legend.key.size =unit(0.5, "lines"))#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.#> i Please use `linewidth` instead.# Convert to an interactive plot with specified sizingp_interactive <-ggplotly(p, tooltip =c("x", "y", "Legend"), dynamicTicks =TRUE, width =800, height =400)p_interactive
We initiated our temporal exploration by examining the daily distribution of accidents throughout the year. Our objective was to explore how the number of accidents was distributed throughout the year, how it varied, and how it compared to the average.
We generated the following graph, which clearly illustrates distinct peaks and troughs in the data, indicating noticeable fluctuations within each day. Notably, specific dates, such as 2022-11-04 and 2022-12-25, prominently stood out with significant spikes in accident counts. For example, on 2022-11-04, we observed an extraordinary peak with over 434 accidents occurring in a single day. Remarkably, only two days in the entire year recorded fewer than 140 accidents: December 25th, which had 132 accidents, and September 19th, which had 139 accidents. The decline in accident numbers on December 25th might reasonably be attributed to a potential reduction in traffic volume resulting from Christmas celebrations. Similarly, September 19th was an extra bank holiday declared across the UK following the unexpected passing of Queen Elizabeth II, which likely resulted in a reduced number of vehicles in circulation on that day.
On average, there were 277 accidents per day throughout the entire year (indicated by the green line). By employing a 30-day rolling average, which helps smooth out daily fluctuations and reveals underlying trends, we were able to discern monthly patterns in our data. This was evident as the line on the chart fluctuates above and below the average line depending on the month. This is a primary that months might have an impact on the number of accidents that occur, something that we will investigate later in our paper.
Recognizing the occurrence of daily and monthly variations throughout the year, we then embarked on an exploration of potential patterns within different time frames, including hours of the day, days of the week, and months of the year.
3.2.2 Hourly and Weekly Exploration
Code
agg_data <- q2_clean %>%group_by(hour, day_name) %>%summarise(accidents_count =n()) %>%ungroup()ggplot(agg_data, aes(x = hour, y = day_name, fill = accidents_count)) +geom_tile() +scale_fill_gradient(low ="#F1F3F2", high ="#FE4A49") +labs(x ="Hour",y ="Day of the Week",fill ="Number of Accidents",title ="Heatmap of Accidents by Hour and Day" ) +theme_minimal() +annotate("rect", xmin =5.5, xmax =9, ymin ="Monday", ymax ="Sunday", fill =NA, color ="blue", size =1, linetype ="dashed") +annotate("rect", xmin =14, xmax =19, ymin ="Monday", ymax ="Sunday", fill =NA, color ="blue", size =1, linetype ="dashed")
We began by examining the most common accident peaks, considering both the hour of the day and the day of the week through the utilization of a heatmap. This visualization depicts higher frequencies of accidents in darker shades of red and lower frequencies in lighter shades, approaching white.
In our exploration, we identified two prominent “peaks” on our chart: one occurring between 7 a.m. and 8 a.m. and another between 3 p.m. and 6 p.m. Interestingly, this consistent pattern extended from Monday to Saturday, aligning with the typical workweek.
This temporal pattern underscores the fact that accidents tended to occur predominantly during rush hours, possibly due to the higher volume of road users during these times, which might have increased the likelihood of accidents.
3.2.3 Weekly Exploration
Next, we looked at the daily accident count categorized by day of the week. This closer examination revealed a notable pattern: as the week unfolded, there was a gradual increase in the average number of road accidents per day, culminating in a peak on Saturdays, which could be reflective of heightened traffic due to leisure activities or perhaps the cumulative fatigue accumulated over the workweek.
Interestingly, Mondays not only had the fewest incidents on average but also exhibited a maximum number of accidents (excluding outliers) that was lower than even the median accident count observed on Fridays and Saturdays.
The distribution of accidents indicated a relatively low degree of fluctuation in daily accident counts. Mondays and Saturdays, for instance, exhibited tight interquartile range, suggesting a consistent number of accidents. Tuesday had slightly broader box, indicating a less predictable pattern, which could warrant further investigation into external factors influencing these fluctuations.
The proximity of the median to the daily mean—depicted by the red dot—on most days suggested a symmetrical distribution of data.
Outliers on the plot could be the result of specific and unusual circumstances such as public events or extreme weather conditions.
In light of these findings, we investigated further the association between the day of the week and accident occurrence in our analysis as well as any potential links between the day of the week and accident severity.
3.2.4 Monthly Exploration
Code
long_format_month <- q2_clean %>%gather(key ="severity_type", value ="count", num_fatal, num_serious, num_slight) %>%group_by(date, month_name, severity_type) %>%summarise(daily_count =sum(count)) %>%ungroup()# Normalizing the data (calculating the z-score)long_format_month <- long_format_month %>%group_by(severity_type) %>%mutate(mean_count =mean(daily_count),sd_count =sd(daily_count),normalized_count = (daily_count - mean_count) / sd_count ) %>%ungroup()# Create the box plot with normalized countsggplot(long_format_month, aes(x = month_name, y = normalized_count, fill = severity_type)) +geom_boxplot() +geom_hline(yintercept =0, linetype ="dashed", color ="red") +# Adds a horizontal line at y = 0labs(x ="Day of the Week",y ="Normalized Number of Casualties",title ="Normalized Boxplot of Casualties by Day of the Week and Severity Type" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),legend.title =element_blank() )
We then turned our focus to the daily accident counts across the months.Upon examining the monthly distribution of the daily average of road accidents, we observed that each month maintained a relatively stable median of incidents throughout the year. The boxplot revealed that the month of December exhibited a wider interquartile range, indicating significant day-to-day variability in the number of accidents. September, on the other hand, presented a contrasting picture with its notably narrower spread.
The patterns that can be observed on the boxplot above show a very interesting discovery. It indicates that while month-to-month changes had relatively minimal impacts on the overall daily accident average, specific months displayed more variations. These variations could be linked to external factors associated with particular months, such as holidays, school vacations or weather patterns.
Even more interestingly we can set the hypothesis that months can hold specific temporal patterns whereas weeks, which repeat 52 times in a year, might not hold the same underlying temporal trends. This interesting discovery is therefore something we looked further in our analysis.
3.3 Demographic Exploration
In this section, we explored the relationship between demographics and accident occurrences. Unlike our previous sections, our exclusive focus here was on the drivers of the vehicles, a deliberate choice aimed at mitigating biases.
From a statistical standpoint, it made more sense to examine the characteristics of the individuals responsible for the accidents rather than passengers or other individuals whose demographics wouldn’t significantly impact the accident outcomes.
3.3.1 Exploring Age Distribution
Code
max_count <-max(table(filtered_q3a_clean$age_of_casualty))plotly_gg <-plot_ly(data = filtered_q3a_clean,x =~age_of_casualty,type ="histogram",marker =list(color ="#4c4e4d")) %>%layout(title ="Age Distribution of Driver",xaxis =list(title ="Age"),yaxis =list(title ="Number of Accidents per Given Age"),hovermode ="x" ) %>%config(responsive =TRUE) # Responsive to container sizeplotly_gg %>% plotly::layout(width =800, height =400) # Set plot size#> Warning: Specifying width/height in layout() is now deprecated.#> Please specify in ggplotly() or plot_ly()
Code
age_group_counts <-table(filtered_q3a_clean$age_groups)age_group_proportions <-as.numeric(prop.table(age_group_counts))age_group_df <-data.frame(Age_Group =names(age_group_counts), Proportion = age_group_proportions)age_group_df$Percentage <-round(age_group_df$Proportion *100, 2)age_group_df <- age_group_df[, c("Age_Group", "Percentage")]age_group_kable <- knitr::kable(age_group_df, caption ="Proportion of Accidents by Age Groups")age_group_kable
Proportion of Accidents by Age Groups
Age_Group
Percentage
Young (0-18)
6.82
Young Adult (19-25)
17.25
Adult (26-35)
24.39
Middle-Aged (36-50)
25.86
Senior (51-65)
17.73
Old (66+)
7.95
The histogram, which is right skewed, revealed that approximately 29.72% of accidents involved drivers aged between 18 and 29, with the highest frequency occurring at the age of 29 . We can also see that as of 29 years old, the distribution lowers drastically until the age of 44, where it increases then drops steadily again to plateau between 65 and 75.
This data aligns with the typical age range during which individuals first acquire their driving licenses and may suggest that young, relatively inexperienced drivers might have a heightened likelihood of being involved in accidents, or that these individuals might drive more than their older counterparts. The factors contributing to this increased risk might include but are not limited to: limited driving experience, a higher inclination towards risk-taking behaviors, a greater likelihood of being distracted, a tendency to drive at night, and a preference for certain types of vehicles that may be more challenging to control such as motorcycles vs cars.
3.3.2 Exploring Gender
Upon examining age, our focus shifted to gender. The data distinctly indicates a higher number of accidents among men, with over 40,000 slight accidents involving male drivers compared to less than 20,000 involving female drivers. One plausible hypothesis could be a significant gender imbalance in the driving population, with a higher proportion of male drivers compared to female drivers for all severity level. The UK government released data from 2020 that showed that men drove on average 22% more miles than women per year (Department for Transportation, 2021). Another interesting explanation for both the larger distance traveled for men as well as their increased number of accidents is that men might find themselves using their vehicles as both a means of transport to and from work, but as well as might use their vehicles as a tool of their trade (ex: trucks for truck drivers, or taxis for taxi drivers). In order to quickly explore this hypothesis, we looked at the proportion of male taxi drivers in our dataset which showed that 95.904% and is in line with this hypothesis.
3.5 Vehicle Exploration
3.5.1 Understanding Road Accident Frequencies
In this section we set out to explore vehicles and their characteristics to be able to fully understand how our data is laid out.
After examining various demographics, our focus shifted to vehicle types to better understand their distribution and the proportion of accidents involving each category.
The mosaic plot offers a clear depiction of this distribution, showing that cars were involved in a predominant share of accidents. Specifically, cars accounted for 68.43% of incidents, which corresponds to 86739 cases. This substantial figure significantly surpassed that of other vehicle types, highlighting cars as a major area of concern in traffic-related incidents.
In contrast, trucks accounted for 7.5% of accidents, amounting to instances. Cyclists made up a marginally higher percentage, being involved in 12.89% of the accidents, which equates to 16335 recorded instances.
3.5.2 Vehicle Accidents Over 2022
Then, we examined the number of accidents per vehicle type during each month of the year (since the frequency of accidents involving each vehicle type were not the same, the axis of each chart are different).
The objective of this visualization was to facilitate the exploration and comparison of accident trends between various vehicle types throughout the year.
A common trend we identified across all vehicle categories was a decrease in the number of accidents during the winter months, particularly in December, January, and February. This decrease was especially pronounced for motorcycles, bicycles, and trucks. Both cars and trucks experienced an increase in the number of accidents from September through November, followed by a sharp decline in December.
Upon examining the peaks in each category, we found that November had the highest number of accidents for cars, with a total of 7907 accidents. In contrast, July emerged as the peak month for bicycles, with - accidents, and June took the lead for motorcycles, with 1234 accidents. Similar to the cars, trucks had their accident peak in November with - accidents.
Another interesting discovery is that motorcycles and bicycles both displayed a peak in accidents during the warmer months, hinting that individuals might use motorcycles and bicycles as a primary means of transport in warmer months, but then switch to another means of transport when the weather worsens such as public transport or cars.
This similarity in car and truck accident trends could be attributed to their susceptibility to adverse winter weather conditions, while the increased accidents for motorcycles and bicycles during warmer months might be linked to greater usage and improved road conditions.
4. Analysis
The analysis section of our report consistently follows a structured approach, designed to enhance reader understanding and facilitate the development of our analytical and data science skills. Our methodology involves initially addressing each question with a simpler method, then reinforcing our conclusions through more intricate analytical techniques. We aim for our analysis to be accessible and comprehensible to individuals with varying levels of expertise in statistics and data science.
4.1 Research Question #1: Why are certain locations more prone to accidents? Why are they particularly more dangerous?
In this section, we sought out not to only spatially pinpoint areas with both the highest incidences of accidents but also to pinpoint locations that were prone to more serious accidents as well. In contrast to our earlier exploratory analysis, where we examined spatial patterns by vehicle types, our approach here aimed to comprehensively assess road safety by considering all types of road users.
Our methodological approach involved the following steps:
1. Creation and computation of accident and severity indices: We computed both an accident index and a severity index for each region, mapping them to visualize the density of accidents and severe accidents across the regions. This allowed us to identify regions that were prone to higher rates of accidents as well as locations that were more prone to severe accidents (most dangerous locations to drive).
2. Comparative analysis of spatial characteristics:
Then, we compared these two sets of regions with each other and against the UK’s baseline average, examining the proportion of accidents in each region in relation to various spatial characteristics. This analysis was aimed at uncovering hidden spatial trends in road accidents and their potential correlation with accident frequency and severity.
3. Logit regression: Lastly, we performed a logistic regression analysis to dive deeper into the likelihood of being involved in a more severe accident as opposed to a slight one, considering the presence of specific spatial variables. This stage of our study helped to contextualize and deepen our understanding of the results obtained from our initial two analytical steps.
4.1.1 Spatial Mapping w/ Severity Indices:
To answer our research question and identify the the regions most susceptible to accidents, including those of a severe nature, we developed two distinct indices. The initial index, known as the “Accident Index,” was formulated by standardizing the number of accidents per Unitary Authority (UTLA) and then dividing this figure by 10,000 inhabitants (formula provided below). This Accident Index was subsequently depicted on individual regions for each UTLA, illustrating the adjusted frequency of accidents within each area.
The second index, referred as “Severity Index,” employed the same calculation as the first, with one crucial distinction: it excluded all incidents categorized as slight accidents. Similar to our approach for the Accident Index, we also mapped the Severity Index for each UTLA. This dual approach revealed significant disparities among the UTLAs, emphasizing that while certain areas experienced a higher frequency of accidents, the evaluation of severity produced contrasting outcomes.
Below you can find two interactive maps, one for the accident index and the other for the severity index. The maps are interactive and allow you to zoom in and out on the regions, you can also click on them to obtain information on their populations and number of accidents. You can also find the top five UTLA’s per the accident index as well as the severity index in the table below as well.
Code
library(viridis)# Here I need to adjust the bins for the chloropleth map since City of London is SO MUCH BIGGER than the others it screws with our nuances!!! So I am going to exclude the highest value from the calculation - therefore we will have say 10 quanties that will be the steps from dark purple to our yellow. It might be a little more complex to add the colours hre but I'll try my best - I don't think that leaflet likes anything with conditional coloring. max_value <-max(utla_boundaries_with_data$accident_index, na.rm =TRUE) # Calculates the UTLA with the higest indexsecond_max_value <-max(utla_boundaries_with_data$accident_index[utla_boundaries_with_data$accident_index < max_value])# Calculates the UTLA with the second highest index# The following code uses viridis package to get the colours for the map, feel free to change with colours that match your style more. I believe that these colours are great as they provide a great contrast between the regions for a good comparatison. # Given the findings, this code neglects the highest indexed UTLA since it was disproportionately high. In order to not skew the results and decrease contrast between other areas, we decided that we will use the second highest location as the "top" value to create our bins. Therefore if this is not the case for you - or you want to use a different means of normalization, just change this to max value instead of second max value. color_palette <-colorBin(palette ="viridis",domain =c(min(utla_boundaries_with_data$accident_index, na.rm =TRUE), second_max_value),bins =10,pretty =FALSE)# Here we wanted to show the top location that was disproportionately higher in red, so we changed the max value UTLA to red. adjusted_color_palette <-function(value) {ifelse(value == max_value, "red", color_palette(value))}# Create a leaflet map with the adjusted color palette# Here we begin the customization of the leaflet map to map our UTLA's with their colours depending on their index.leaflet_map <-leaflet(utla_boundaries_with_data) %>%addProviderTiles(providers$CartoDB.Positron) %>%# This we selected as it's a map with less definition, feel free to change but note that it will increase your file size - if this is not a problem for you I'd highly suggest open stree map. heres a link to all the available backgrounds, have fun :) https://leaflet-extras.github.io/leaflet-providers/preview/index.htmladdPolygons(fillColor =~adjusted_color_palette(accident_index),fillOpacity =0.7, # Makes the colors a bit see-through so the map underneath peeks throughcolor ="#444444", # This is for the outlines of each UTLA areaweight =1, # How thicc we want those outlines# Pop-ups are cool - they show more info when you click on an areapopup =~paste(ctyua19nm,"<br>Accidents: ", accident_count,"<br>Population: ", total_population,"<br>Accident Index: ", round(accident_index, 2)) ) %>%# Let's add a legend to make sense of the colorsaddLegend("bottomright", # Placing it in the bottom-right corner of the maptitle ="Accident Index (per 100,000 people)", # A title for our legend# Here we define the color scheme for the legend to match the mappal = color_palette,values =~accident_index, # The range of index values we're displaying# Formatting the legend labels to show the quantiles and the max valuelabels =sprintf("%.2f", c(quantile(c(min(utla_boundaries_with_data$accident_index, na.rm =TRUE), second_max_value), probs =seq(0, 1, length.out =11)), "Max")) )# And voila, our map's all set to go!leaflet_map# EXPECTED OUTPUT IS AN INTERACTIVE LEAFLET MAP WITH UTLA BOUNDERIES WITH DIFFERENT COLOURS WHICH REPRESENT THEIR CORRESPONDING ACCIDENT INDEXES
Code
# This follows the first map creation very closely!!max_value <-max(utla_boundaries_with_severity_data$adjusted_accident_index, na.rm =TRUE)data_for_quantiles <-filter(utla_boundaries_with_severity_data, adjusted_accident_index < max_value)# Let's create our quantile bins - which will be the 10 layers of colours used in our mapnum_quantiles <-10quantile_bins <-quantile(data_for_quantiles$adjusted_accident_index, probs =seq(0, 1, length.out = num_quantiles +1), na.rm =TRUE)# As discussed we used the viridis colour palette, but feel free to change. custom_color_palette <-function(x) {# If it's the max value, let's highlight it in red. Everything else gets the 'viridis' treatment.ifelse(x == max_value, "red", colorBin(palette ="viridis", bins = quantile_bins, domain = data_for_quantiles$adjusted_accident_index)(x))}# Time to pick some colors for our legend - one for each slice of the pie (quantile) and one for the top spot!# This defines the legend colours using viridis and sets the top one as red - which we did on the other map as well. legend_colors <-c(viridis(length(quantile_bins) -1), "red")# Legend labels - we want these to be clear and easy to understandlegend_labels <-c(sprintf("%.2f", quantile_bins[-length(quantile_bins)]), "Max")# Setting our map's starting view. We're centering it around the middle of the UK. I didn't do this on the other map and this was a test - it doesn't change much and it's hard to implement, so up to you if this is something you'd want to replicate if you are redoing this report or want to use it somewhere elsewhereuk_center_lat <-54.7uk_center_lon <--3.4initial_zoom_level <-6# Now we're putting it all togetherleaflet_map <-leaflet(utla_boundaries_with_severity_data) %>%setView(lng = uk_center_lon, lat = uk_center_lat, zoom = initial_zoom_level) %>%addProviderTiles(providers$CartoDB.Positron) %>%# Adding UTLA polygons to the map. They'll be colored based on our custom palette that we defined beforeaddPolygons(fillColor =~custom_color_palette(adjusted_accident_index),fillOpacity =0.7,color ="#444444",weight =1,# Pop-ups for extra info - always handy!popup =~paste(ctyua19nm,"<br>Accidents: ", accident_count,"<br>Population: ", total_population,"<br>Severity Index: ", round(adjusted_accident_index, 2)) ) %>%# And of course, our legend - can't forget that!addLegend("bottomright",title ="Severity Accident Index (per 100,000 people)",labels = legend_labels,colors = legend_colors )# And there we have it, our map's ready to go!leaflet_map
Interactive Accident Index Choropleth Map
Interactive Severity Index Choropleth Map
Important
The two maps above are created using Leaflet - an interactive mapping package for R Studio. However, due to the lack of processing ability to knit these two maps into a self contained HTML file we were unable to add them in their interactive form. We do invite you to run the analysis file, or render our project with “self-contained” turned off to get the full experience.
This dual approach revealed significant disparities among the UTLAs, emphasizing that while certain areas experienced a higher frequency of accidents, the evaluation of severity produced contrasting outcomes. For instance, London and Westminster consistently ranked as the top two areas in both indices, yet, when scrutinizing the Severity Index, Blackpool emerged as a top-three contender.
When examining the occurrence of road accidents, both the City of London and Westminster, both areas consistently ranked at the top in the accident indexes.
The City of London, often referred to as the financial heart of the United Kingdom, is characterized by a high concentration of financial institutions and offices. This results in a significant influx of commuting professionals daily, contributing to increased footfall and vehicle traffic (it is estimated that the population increases up to x60 fold during the daytime (Jack Brown, Sara Gariban, Erica Belcher, Mario Washington-Ihieme, 2020).
Similarly, Westminster is a hub for tourists, drawing millions each year to its historic sites such as the Big Ben. This tourist traffic, combined with the area’s everyday operational demands, leads to busy streets and a higher likelihood of traffic incidents.
In comparison to other UK regions, both the City of London and Westminster stand out for their low residential populations, contrasted by high numbers of visitors. These areas are characterized by a mix of heavy pedestrian and vehicle traffic, coupled with a notable presence of tourists and professionals. This unique combination of factors could potentially increase the likelihood and severity of accidents in these locations, potentially skewing their accident indexes.
To maintain the integrity of our analysis and to mitigate any biases, we therefore proceeded by examining UTLAs that did not exhibit the same characteristics as the City of London and Westminster.
This led us to the discovery of two accident hotspots: Kensington and Chelsea, known for its high frequency of accidents, and Blackpool, distinguished by the severity of its accidents. Our findings were in line with other researchers, as Blackpool has recently been recognized as the most dangerous location to drive outside of London (Antony Thrower, 2023).
Therefore, using both our accident and severity indexes we were able to answer our research question, concluding that:
• The City of London and Westminster were the two regions in the UK with the highest frequency of both normal and severe accidents.
• Kensington and Chelsea were a region highly prone to frequent accidents, though these tended to be less severe.
• Blackpool, on the other hand, was a region particularly prone to severe accidents.
With these newly identified accident hotspots, we proceeded to examine how Kensington and Blackpool vary from the typical UK patterns in accident characteristics, using a comparative analysis.
4.1.2 Spatial Characteristic Proportion Table:
To achieve this, we selected spatial variables within our dataset and calculated the average proportion of accidents associated with each variable in Kensington, Blackpool, and our UK Average Baseline.
The resulting table is valuable for readers, as it facilitates a comparison of the spatial characteristics most associated with locations experiencing more and less severe accidents, relative to the UK average. It’s important to note, however, that this table is primarily intended to provide insights or generate hypotheses about potential factors leading to severe accidents. It does not offer conclusive evidence, as numerous other factors could have influenced these outcomes. The results of this table are discussed further along in section 4.1.4.
Comparison of Accident Characteristics between Two UTLAs and UK Baseline
characteristic
Category
Kensington
Blackpool
UK Baseline
first_road_class
Motorway
NA
NA
3.01%
first_road_class
A(M)
NA
NA
0.28%
first_road_class
A
61.20%
32.17%
44.34%
first_road_class
B
11.60%
13.06%
12.43%
first_road_class
C
10.60%
NA
4.34%
first_road_class
Unclassified
16.60%
54.78%
35.60%
light_conditions
Daylight
69.80%
73.25%
71.76%
light_conditions
Darkness - lights lit
28.80%
24.84%
20.89%
light_conditions
Darkness - lights unlit
0.20%
0.64%
0.74%
light_conditions
Darkness - no lighting
0.60%
0.32%
5.33%
light_conditions
Darkness - lighting unknown
0.60%
0.96%
1.29%
road_surface_conditions
Dry
81.40%
73.89%
75.72%
road_surface_conditions
Wet or damp
18.00%
24.20%
22.32%
road_surface_conditions
Snow
NA
NA
0.19%
road_surface_conditions
Frost or ice
0.60%
1.59%
1.65%
road_surface_conditions
Flood over 3cm. deep
NA
0.32%
0.12%
road_type
Roundabout
3.80%
3.50%
6.09%
road_type
One way street
11.20%
NA
2.27%
road_type
Dual carriageway
19.00%
8.28%
15.75%
road_type
Single carriageway
64.20%
86.94%
74.11%
road_type
Slip road
1.80%
1.27%
1.78%
special_conditions_at_site
Oil or diesel
NA
NA
0.11%
special_conditions_at_site
Mud
NA
NA
0.20%
special_conditions_at_site
None
96.00%
98.73%
97.83%
special_conditions_at_site
Auto traffic signal - out
1.00%
0.32%
0.29%
special_conditions_at_site
Auto signal part defective
0.20%
NA
0.05%
special_conditions_at_site
Road sign or marking defective or obscured
0.60%
0.64%
0.15%
special_conditions_at_site
Roadworks
1.60%
0.32%
1.16%
special_conditions_at_site
Road surface defective
0.60%
NA
0.21%
speed_limit
20
67.00%
11.46%
14.31%
speed_limit
30
31.40%
85.35%
54.57%
speed_limit
40
1.60%
2.55%
9.04%
speed_limit
50
NA
NA
4.47%
speed_limit
60
NA
0.32%
11.87%
speed_limit
70
NA
0.32%
5.75%
urban_or_rural_area
Urban
98.00%
95.86%
66.85%
urban_or_rural_area
Rural
2.00%
4.14%
33.15%
weather_conditions
Fine no high winds
85.20%
77.39%
84.80%
weather_conditions
Raining no high winds
8.80%
13.38%
9.59%
weather_conditions
Snowing no high winds
NA
NA
0.27%
weather_conditions
Fine + high winds
1.00%
1.59%
0.92%
weather_conditions
Raining + high winds
0.40%
2.23%
0.84%
weather_conditions
Snowing + high winds
NA
NA
0.05%
weather_conditions
Fog or mist
0.20%
NA
0.54%
weather_conditions
Other
4.40%
5.41%
2.99%
Code
library(gt)# Define UTLA codesutla_codes <-c("E09000020", "E06000009")# Function to calculate proportions for a given characteristiccalculate_proportions <-function(dataset, utla_code, characteristic) { characteristic_sym <- rlang::sym(characteristic) dataset %>%filter(UTLA20CD == utla_code) %>%count(!!characteristic_sym) %>%mutate(proportion = n /sum(n),UTLA20CD = utla_code,characteristic = characteristic) %>%rename(Category =!!characteristic_sym)}# Apply the function to each characteristic for each UTLA codeproportions_list <-list()characteristics <-c("urban_or_rural_area", "road_type", "weather_conditions", "road_surface_conditions", "speed_limit")for (code in utla_codes) {for (char in characteristics) { prop_data <-calculate_proportions(q1_clean, code, char) proportions_list[[length(proportions_list) +1]] <- prop_data }}# Combine all proportions into one dataframecombined_data <-bind_rows(proportions_list)# Example UTLA code to name mappingutla_names <-data.frame(UTLA20CD =c("E09000020", "E06000009"),UTLA_Name =c("x", "y"))# Add a new column to combined_data for UTLA namescombined_data <- combined_data %>%mutate(UTLA_Name =case_when( UTLA20CD =="E09000020"~"UTLA Name 1", UTLA20CD =="E06000009"~"UTLA Name 2" ))# Pivot the datacombined_data_wide <- combined_data %>%pivot_wider(names_from = UTLA_Name,values_from = proportion,id_cols =c(characteristic, Category) ) %>%mutate(across(starts_with("UTLA Name"), scales::percent, accuracy =0.01))# Create the GT tablegt_table <- combined_data_wide %>%gt() %>%tab_header(title ="Comparison of Accident Characteristics between Two UTLAs" ) %>%cols_label(`UTLA Name 1`="UTLA Name 1 Proportion (%)",`UTLA Name 2`="UTLA Name 2 Proportion (%)",Category ="Category" )# Print the tableprint(gt_table)
4.1.3 Logistic Regression Results:
Expand to learn more about Logistic Regression
Brief Overview of Logistic Regression: Logistic regression is a statistical method used for modeling the relationship between a binary dependent variable and one or more independent variables. It’s particularly useful for understanding how different factors contribute to the probability of a certain event occurring. In logistic regression, we estimate the odds of the dependent variable being in one category versus another, based on the independent variables. The output is in the form of odds ratios, which indicate how the likelihood of the outcome changes with a unit change in the independent variable.
Continuing from the groundwork laid by our spatial mapping and characteristic proportion analysis, we further refined our understanding of spatial characteristics and their impact on severity through the use of a logistic regression analysis. This method enabled us to assess the probability of severe accidents, as opposed to minor ones, across a range of conditions and spatial characteristics.
We visualized our findings in a forest plot depicting only the statistically significant characteristics. Each point in the plot denoted an odds ratio for a specific variable, with horizontal lines representing the 95% confidence intervals. A ratio of 1 indicates neutrality, above 1 suggests an increased likelihood of a more severe accident (zone in red), and below 1 indicates a reduced likelihood (zone in green). We’ve also included the spatial logistical regression results. Only coefficients that are significant at a p-value < 0.01 are in bold. Variables on the right side of the middle line have increased associated odds with being involved in a serious or fatal accident compared to a slight one. And those to the left have decreased associated odds. This gives us already a primary understanding of the association between spatial characteristics and their impact on severity. The results of this regression are discussed in section 4.1.4 in the form of a table.
Characteristic
N
OR1
95% CI1
p-value
speed_limit
95,220
20
—
—
30
1.13
1.08, 1.19
<0.001
40
1.35
1.26, 1.45
<0.001
50
1.58
1.45, 1.73
<0.001
60
1.64
1.52, 1.76
<0.001
70
1.50
1.34, 1.67
<0.001
road_type
95,220
One way street
—
—
Roundabout
0.78
0.69, 0.90
<0.001
Dual carriageway
1.05
0.93, 1.19
0.46
Single carriageway
1.36
1.21, 1.53
<0.001
Slip road
0.89
0.75, 1.05
0.17
light_conditions
95,220
Daylight
—
—
Darkness - lights lit
1.24
1.19, 1.29
<0.001
Darkness - lights unlit
1.54
1.30, 1.81
<0.001
Darkness - no lighting
1.35
1.26, 1.44
<0.001
Darkness - lighting unknown
0.86
0.74, 0.99
0.033
weather_conditions
95,220
Fine no high winds
—
—
Raining no high winds
0.89
0.84, 0.95
<0.001
Snowing no high winds
0.76
0.52, 1.09
0.15
Fine + high winds
1.20
1.03, 1.39
0.018
Raining + high winds
0.83
0.70, 0.99
0.041
Snowing + high winds
0.26
0.08, 0.67
0.013
Fog or mist
0.95
0.78, 1.16
0.62
Other
0.79
0.72, 0.87
<0.001
urban_or_rural_area
95,220
Urban
—
—
Rural
1.18
1.13, 1.23
<0.001
first_road_class
95,220
Unclassified
—
—
Motorway
0.79
0.70, 0.89
<0.001
A(M)
0.99
0.73, 1.31
0.94
A
1.03
0.99, 1.06
0.18
B
1.11
1.05, 1.16
<0.001
C
0.86
0.79, 0.93
<0.001
special_conditions_at_site
95,220
None
—
—
Auto traffic signal - out
0.68
0.49, 0.93
0.020
Auto signal part defective
0.52
0.20, 1.14
0.14
Road sign or marking defective or obscured
0.96
0.65, 1.40
0.84
Roadworks
0.86
0.74, 0.99
0.041
Road surface defective
1.77
1.32, 2.36
<0.001
Oil or diesel
1.43
0.94, 2.14
0.087
Mud
0.71
0.50, 0.99
0.048
road_surface_conditions
95,220
Dry
—
—
Wet or damp
1.00
0.96, 1.05
0.91
Snow
0.95
0.61, 1.47
0.83
Frost or ice
0.75
0.66, 0.85
<0.001
Flood over 3cm. deep
0.65
0.40, 1.03
0.079
1 OR = Odds Ratio, CI = Confidence Interval
4.1.4 Summary of our Proportion Table and Logistic Regression
We now offer a cohesive summary that merges insights from the Spatial Characteristic Proportion Table with those from our logistic regression analysis, providing a thorough overview. While this synthesis aligns some spatial characteristics with the regression results, it also reveals contradictions, as there are countless variables (not included in our project) that may have impacted accidents occurrences and their severity. These findings encourage readers to develop their own hypotheses regarding the variations in accident frequency and severity, acknowledging that some aspects may extend beyond the scope of our current analysis.
Summary of our Proportion Table and Logistic Regression
Spatial Variables
Results from Proportion Table
Results from Logistic Regression
Rural vs. Urban
Kensington reported 98% and Blackpool 95.86% of accidents in urban areas, both notably higher than the UK baseline of 66.85%.
Rural areas increased the odds of severe accidents with a log odds of 1.18
Road Types
Kensington exhibited a higher incidence of accidents on dual carriageways (19%) compared to the baseline (15.75%), possibly indicating issues related to safety features or increased traffic volume on these roads.
Blackpool had a significantly higher percentage of accidents on single carriageways (86.94%) compared to Kensington (64.20%) and the UK average (74.11%).
A higher proportion of accidents in Blackpool occurred on “unclassified roads” (54.78%), compared to Kensington (16.60%) and the UK baseline (35.60%). This may suggest a lower level of road network organization and potentially less oversight or regulation by authorities in Blackpool
Motorways were associated with lower odds of severe accidents (OR 0.79) compared to single carriageways, which often correlate with higher severity incidents. As mentioned previously, motorways offer increased security measures to protect from oncoming traffic, whereas single carriageways are associated to high-speed limits and limited safety measures.
Speed Limits
In Kensington, a substantial proportion of accidents occurred in 20 mph zones (67%), possibly indicative of traffic calming measures in place within a dense urban environment. In contrast, Blackpool saw a much higher percentage of accidents in 30 mph zones (85.35%) compared to Kensington (31.40%) and the baseline (54.57%).
Higher speed limits generally correlated with increased odds of severe accidents. However, the risk diminished at 70 mph (a typical speed on dual carriageways and motorways that offer safety barriers) compared to 60 mph, possibly due to safer road designs and more careful driving at higher speeds. This confirms that locations with roads that have higher speed limits could be more prone to severe road accidents, whereas locations with lower speed limits could see less severe accidents.
Light Conditions
Both Kensington and Blackpool followed the UK baseline trend, with their majority of accidents occurring during daylight hours.
Darkness substantially raised the odds of severe accidents, as areas with and without streetlights experienced higher odds of more severe accidents compared to accidents during daytime. However, it is to be noted that locations with streetlights still experienced lower odds of more serious accidents compared to locations without any form of lighting. Therefore, we can make the conclusion that locations that have poorly lit or unlit roads will have a higher tendence for more severe accidents compared to locations with well-lit roads.
Road Conditions
Special conditions at the accident site did not vary significantly among Kensington, Blackpool, and the UK baseline.
Sites with defective road conditions increased the likelihood of severe accidents compared to slight ones, underscoring the importance of proper indication of road maintenance. This finding suggests that locations with worse road conditions might find themselves experiencing more severe accidents compared to locations with roads in good conditions.
Note
Unclassified Roads (UCRs): Unclassified roads are typically minor roads, lanes, or tracks that are not designated as A or B roads. They are often rural or local roads that may have limited traffic and may not be paved or well-maintained.
4.1.5 Key Findings on Spatial Characteristics in Road Accidents
In our spatial analysis, we identified spatial locations that were most prone to accidents and severe accidents. In this section, we provide a summary of the key findings that answer our research question.
Identification of Prone Locations:
· The City of London and Westminster emerged as regions with the highest frequency of both regular and severe accidents. This high incidence is likely influenced by their unique urban dynamics, which may include heavy pedestrian and vehicle traffic, a large influx of daily commuters and tourists.
· Kensington and Chelsea were identified as areas with a high frequency of accidents, although these tended to be less severe.
· Blackpool stood out for its high severity index, indicating a propensity for more severe accidents.
Spatial Characteristics:
· Blackpool: High frequency of accidents in urban areas (95.86%), with many occurring on single carriageways (86.94%) and a notable number on unclassified roads (54.78%). A large portion of these accidents happened in 30 mph zones (85.35%).
Note
These characteristics are independent observations. An accident in Blackpool might occur under any of these conditions, and they are not always mutually inclusive (e.g. an accident might happen in an urban area at 30 mph or in a rural setting at the same speed limit)
· Kensington: Significant proportion of accidents were urban (98%), with a high number on dual carriageways (19%) and predominantly occurring in 20 mph zones (67%).
Note
Important note: Similar to Blackpool, these characteristics are individual observations. They present possible scenarios of accidents but do not imply that all accidents in Kensington share these exact characteristics. An accident could occur in a different setting or under varying conditions.
4.2 Research Question 2: What temporal trends can we identify in road accidents in the UK? Can we identify the most dangerous times to be on the road ?
As stated in previously, our initial objective was to gain a fundamental understanding of temporal aspects in accidents through our exploratory analysis. This initial phase provided us with valuable intuitions into the temporal factors that warranted further investigation.
For answering this research question, we took a more in-depth approach by following this methodology:
We started by visualizing the presence of any noticeable trends and their potential correlation with accident severity.
After the visual assessment, we proceeded to statistically validate these identified trends.
Finally, we conducted a logistic regression analysis to determine how temporal variables relate to the likelihood of being involved in a severe accident.
4.2.1 Time of Day
In our analysis, we revisited the insights obtained during our EDA. Initially, we identified two significant ‘peaks’ in accident occurrences : one between 7 a.m. and 8 a.m. and another from 3 p.m. to 6 p.m. Therefore, we decided to create four distinct time ranges: early morning (0-6 AM), morning (6-12 AM), afternoon (12-6 PM), and night (6-12 PM) to capture the nuances of accident distributions throughout the day and night effectively.
We began our analysis by plotting a line chart representing the average number of accidents across different time ranges. This analysis echoed our earlier findings, as it once again highlighted the following pattern: accidents were more prevalent during the afternoon (12-6 PM) time slot, with an average of 118.732 accidents which is significantly higher than the overall average of 69.166 accidents across all time ranges. This temporal finding showed that afternoons between 12-6PM were associated with higher numbers of accidents.
Code
# Grouping accidents by time ranges and severity to see the distributionaccidents_by_time_severity <- q2_clean %>%group_by(time_ranges, accident_severity_chr) %>%summarise(count =n(), .groups ='drop') # Counting the number of accidents in each severity category for every time range# Now, let's find out what proportion of each time range's accidents fall into each severity categoryaccidents_by_time_severity <- accidents_by_time_severity %>%group_by(time_ranges) %>%mutate(proportion = count /sum(count)) %>%ungroup() # This gives us the percentage of, say, fatal accidents out of all accidents in a given time range# These next blocks are for in-line text in the report. We're specifically focusing on early morning dataearly_morning_data <- accidents_by_time_severity %>%filter(time_ranges =="0-6 AM") %>%filter(accident_severity_chr %in%c("Fatal", "Serious")) # Focusing on the serious and fatal accidents in the early morning# These next blocks are for in-line text in the report. We're specifically focusing on early morning datafatal_proportion_early_morning <- early_morning_data %>%filter(accident_severity_chr =="Fatal") %>%pull(proportion) %>%first() # Grabbing the proportion value for 'Fatal'# These next blocks are for in-line text in the report. We're specifically focusing on early morning dataserious_proportion_early_morning <- early_morning_data %>%filter(accident_severity_chr =="Serious") %>%pull(proportion) %>%first() # Getting the proportion value for 'Serious'# Finally, let's visualize all this dataggplot(accidents_by_time_severity, aes(x = time_ranges, y = proportion, fill = accident_severity_chr)) +geom_bar(stat ="identity", position ="fill") +# A stacked bar plot to show proportions of severity categories within each time rangegeom_text(aes(label = scales::percent(proportion, accuracy =1)), position =position_fill(vjust =0.5), color ="black", size =3) +# Adding text labels to our bars for clarityscale_fill_manual(values =c("Fatal"="#FFEB00", "Serious"="#BBBBBB", "Slight"="#4C4E4D"),name ="Severity of Accident"# Custom colors for our bar plot ) +labs(x ="Time Range",y ="Proportion of Accidents",title ="Proportion of Accidents by Time Range per Severity"# Setting up the title and axis labels ) +theme_minimal() +# A clean, minimal theme for the plottheme(legend.position ="bottom") # Positioning the legend at the bottom
Then, we aimed to go beyond just identifying the busiest hours for accidents and also understand how they affected accident severity. Therefore, we created a proportional stacked bargraph (otherwise known as a 100% stacked bar graph) depicting the proportion of accidents by severity within each time range (the use of proportions for this comparison was important as the distribution of accidents were not equal across the time ranges. Hence, conducting a straightforward comparison of accident counts across various severity levels within the different time ranges would have introduced significant bias in our analysis).
From this chart, an intriguing observation emerged. The proportion of fatal (4%) and serious (26%) accidents were highest during the early morning (0-6 AM) time range. This was surprising because this period is typically associated with lower traffic volumes, yet it had a higher proportion of severe accidents. This visually allowed us to conclude that earlier hours of the day were associated with more severe accidents compared to later hours of the day.
Chi-Squared Test Results
Severity
Chi_Squared
DF
P_Value
Fatal
494
3
<0.01
Serious
4024
3
<0.01
Slight
13017
3
<0.01
To go one step further, we confirmed our findings using a statistical test. We chose the chi-square test of independence adjusted to proportions. Our method involved first calculating the overall proportion of each accident severity type (slight, serious and fatal) across all the accidents in 2022. We then used these calculated proportions to estimate the expected frequencies of each of the accidents per severity level.
Expand to learn more about Chi Square Tests
Brief Overview of Chi-Square Test of Independence of Proportions: The chi-square test of independence is a statistical tool used to determine whether there is a significant association between two categorical variables. In this test, we compare observed proportions in different categories against expected proportions that would occur if the variables were independent of each other. Essentially, it answers the question: “Are the differences in proportions just due to chance, or do they reflect a true association between the variables?” The test calculates a chi-square statistic, which measures the discrepancies between observed and expected frequencies. A significant result suggests a noteworthy relationship between the variables being studied.
In Our Case: We applied this test to explore the relationship between the time of day and the severity of traffic accidents (categorized as slight, serious, and fatal). Our aim was to ascertain whether the proportion of accidents’ severities varied significantly across different times of the day, which could indicate a potential link between when an accident occurs and how severe it is.
The chi-square tests reinforced the significance of this discovery, showing a strong association between the time of day and the severity level of an accident across all severity levels (see chi square results above). This showed that the time of day had a statistically significant association with the level of severity of an accident.
To confirm these findings and dive deeper, we set out to understand whether the time range in which an accident occurred increased the odds of being in a more severe accident (serious or fatal). Therefore, we conducted a logistic regression using the time ranges and encoded with the following outcomes: 0 representing a slight accident and 1 representing a serious or fatal accident.
Examining the logistic output above, we found that all coefficients were highly significant (with p-values <0.001). Our reference point, representing the time period from 0-6 AM, had a log odd of 1. Notably, we observed that between 6-12 AM, the odds of being involved in a serious or fatal accident decreased significantly to 0.66. Afterward, the odds went back up towards the end of the day (12-6PM: 0.7 & 6-12PM: 0.75). The statistical significance of these values could also be interpreted through their confidence intervals (CI), where none of them passed over the value of 1 (e.g., 6-12 AM CI: 0.71 - 0.8).
Combining our findings from our visual, statistical and regression analysis, it became clear that while the afternoon (12-6 PM) time range had the highest average number of accidents, accidents occurring early morning (0-6 AM) though fewer in number, were more likely to be severe. This discrepancy could be attributed to several factors, such as driver fatigue, alcohol abuse, or compromised visibility (as found in our spatial analysis), more prevalent during late-night to early-morning hours.
While our findings demonstrated robust associations, it’s essential to acknowledge that our analysis focused solely on time ranges. Consequently, we cannot exclude the possibility that other variables, not accounted for in our model, may have influenced accident severity during these specific time periods.
4.2.2 Day of the Week
During our exploratory data analysis, we noticed a consistent increase in the frequency of accidents across different days of the week. We decided to investigate this upward trend and whether there were statistically significant variations in accident severity based on the day of the week. (N.B: this subsequent analysis follows the same methodology that was laid out in the time range analysis)
#> Warning in geom_text(aes(label = paste("Avg:", round(Average_Number_of_Accidents, : Ignoring unknown parameters: `label.padding`, `label.size`,
#> `label.r`, and `fill`
Looking at the line chart above, depicting the average number of accidents per day of the week, it became evident that Monday consistently stood out with a notably lower average number of accidents throughout the year compared to the baseline average. Conversely, Saturday emerged as the day with the highest average, reaching 321.55 accidents. This trend exhibited a clear weekly pattern, with accidents gradually increasing from Monday onwards and then experiencing a significant drop on Sundays. From this chart, we could conclude that Saturdays consistently recorded the highest average number of accidents compared to any other day of the week.
Code
# here we are creating the count per day and severityaccidents_by_day_severity <- q2_clean %>%group_by(day_name, accident_severity_chr) %>%tally(name ="count")# Calculate the total accidents per daytotal_accidents_by_day <- accidents_by_day_severity %>%group_by(day_name) %>%summarise(total_count =sum(count))# Join to get the total accidents per day alongside the count per severityaccidents_by_day_severity <- accidents_by_day_severity %>%left_join(total_accidents_by_day, by ="day_name")# Calculate the proportion of each severity type per dayaccidents_by_day_severity <- accidents_by_day_severity %>%mutate(proportion = count / total_count)# Create the pivot tableseverity_proportions_pivot_day <- accidents_by_day_severity %>%select(day_name, accident_severity_chr, proportion) %>%pivot_wider(names_from = accident_severity_chr, values_from = proportion)# View the pivot tableseverity_proportions_pivot_day#> # A tibble: 7 x 4#> # Groups: day_name [7]#> day_name Fatal Serious Slight#> <fct> <dbl> <dbl> <dbl>#> 1 Monday 0.0228 0.252 0.725#> 2 Tuesday 0.0138 0.219 0.767#> 3 Wednesday 0.0130 0.214 0.773#> 4 Thursday 0.0128 0.213 0.774#> 5 Friday 0.0136 0.225 0.762#> 6 Saturday 0.0163 0.221 0.763#> 7 Sunday 0.0197 0.243 0.737# THE FOLLOWING TWO CODE BLOCKS REFER TO CREATING VARIABLES FOR INLINE TEXT CODING - PLEASE DISREGARD FOR CALCULATIONSmonday_proportions <- severity_proportions_pivot_day %>%filter(day_name =="Monday") %>%summarise(Fatal = Fatal, Serious = Serious)sunday_proportions <- severity_proportions_pivot_day %>%filter(day_name =="Sunday") %>%summarise(Fatal = Fatal, Serious = Serious)ggplot(accidents_by_day_severity, aes(x = day_name, y = proportion, fill = accident_severity_chr)) +geom_bar(stat ="identity", position ="fill") +geom_text(aes(label = scales::percent(proportion, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3) +scale_fill_manual(values =c(Fatal ="#FFEB00",Serious ="#BBBBBB",Slight ="#4C4E4D"),name ="Severity of Accident" ) +labs(x ="Day of Week",y ="Proportion of Accidents",title ="Proportion of Accidents by DOW per Severity" ) +theme_minimal() +theme(legend.position ="bottom", panel.background =element_rect(fill ="#f1f3f2", colour ="#f1f3f2"))
Next, continuing our analysis, we set out to determine if the day of the week had any association with the level of severity of its accidents. Interestingly, despite not having the highest accident frequency, Mondays exhibited the highest proportions of fatal (2.3%) and serious (25.2%) accidents, while Sundays followed closely with a relatively similar proportion of fatal (2.0%) and serious (24.3%) accidents.
This interesting discovery suggested that even though Saturdays had the highest number of accidents during the week, the severity of these accidents was generally lower compared to those occurring on Mondays. Conversely, despite the lower accident count on Mondays, a disproportionately higher percentage of these incidents resulted in more severe injuries when compared to other days of the week.
Chi-Squared Test Results for the Day of the Week
Severity
Chi_Squared
DF
P_Value
Fatal
73.5
6
<0.01
Serious
77.1
6
<0.01
Slight
35.7
6
<0.01
Continuing with our analysis, we validated our findings by conducting a Chi-Squared test of independence based on proportions. This test affirmed a robust and significant association between the day of the week and the severity of accidents that occurred on those days as all p-values for the three levels of severity were below the 1% significance threshold (see chi results above).
Applying the same methodology, we conducted another logistic regression analysis, using “slight” accidents as the baseline (0) and designating serious/fatal accidents as the event of interest (1). This not only reaffirmed our earlier findings but also unveiled additional information, as presented in the regression table below.
Monday (the most dangerous day of the week) was designated as the reference day with a log odd set at 1, serving as the baseline for comparison with all other days. In contrast to Monday, every other day of the week exhibited reduced odds of experiencing a severe or fatal accident, with odds ratios (ORs) consistently below 1. These associations were statistically significant at a 1% level, except for Sunday, which had an OR of 0.94 and a p-value of 0.032, suggesting a weaker but signifigant association when taking 5% as the threshold. These results confirm our temporal findings that Monday remains the day of the week that is the most associated to dangerous road accidents.
4.2.3 Month of Year:
Drawing upon the same methodology used for analyzing day-of-the-week and time-of-day patterns, we’ve now turned our focus to temporal trends related to months, while also incorporating the factor of accident severity into our analysis.
Code
accidents_by_month <- q2_clean %>%group_by(month_name) %>%summarise(Total_Number_of_Accidents =n()) %>%ungroup()overall_avg_month <-mean(accidents_by_month$Total_Number_of_Accidents)# Determine the upper limit for the y-axisupper_limit_month <-max(accidents_by_month$Total_Number_of_Accidents) +max(accidents_by_month$Total_Number_of_Accidents) *0.05# this code is for inline text code - please disregard for non calculation purpousesnovember_accidents <- accidents_by_month$Total_Number_of_Accidents[accidents_by_month$month_name =="November"]# this code is for inline text code - please disregard for non calculation purpousesoverall_avg_month <-mean(accidents_by_month$Total_Number_of_Accidents)# Create the line plot with improved data point visibility, adjusted y-axis limits, and average lineggplot(accidents_by_month, aes(x = month_name, y = Total_Number_of_Accidents, group =1)) +geom_line(color ="#4C4E4D") +# Line plot with specified colorgeom_point(color ="#4C4E4D", size =3) +# Data points with specified color and increased sizegeom_text(aes(label = Total_Number_of_Accidents),vjust =-1.5, size =3, color ="#4C4E4D", hjust =0.5, label.padding =unit(0.5, "lines"), label.size =0, # Remove border around textlabel.r =unit(0.15, "lines"), # Rounded cornersfill ="white") +# Background color for labelsgeom_hline(yintercept = overall_avg_month, linetype ="dashed", color ="red", size =1) +# Overall average linelabs(x ="Month of the Year",y ="Total Number of Accidents",title ="Total Number of Accidents by Month of the Year" ) +theme_minimal() +theme(panel.background =element_rect(fill ="#f1f3f2", colour ="#f1f3f2"),axis.text.x =element_text(angle =45, hjust =1) ) +ylim(7000, upper_limit_month) # Set y-axis limits#> Warning in geom_text(aes(label = Total_Number_of_Accidents), vjust = -1.5, : Ignoring unknown parameters: `label.padding`, `label.size`,#> `label.r`, and `fill`
Our initial observation from this line chart above, illustrating the number of accidents per month of the year, was a significant peak in November, which notably exceeded the average monthly count with 9061 accidents compared to an overall monthly average of 8415. We also observed a period of high volatility in accidents between January and April, followed by a consistent increase from August to November.
This plot confirmed the trend that the months of September, October, and November consistently had the highest number of accidents throughout the year. Considering our detailed analysis of the year 2022, it would be interesting to investigate whether this trend held across different years, which would provide further evidence supporting our finding that November consistently experiences the highest number of accidents in the year.
Then, we plotted a bar chart displaying the proportion of each accident severity for each month. The primary insight observed from this chart above was the relative constancy in the distribution of accident severity, despite the high fluctuations in the monthly total accident numbers. This suggested that the variables influencing the frequency of accidents might not be the same as those affecting their severity. We observed a slightly larger increase in the proportion of serious accidents in the months of July and August.
Code
chi_sq_results_month <-data.frame()overall_proportions <- accidents_by_month_severity %>%group_by(accident_severity_chr) %>%summarise(overall_count =sum(count), .groups ='drop') %>%mutate(overall_proportion = overall_count /sum(overall_count))total_accidents_by_month <- accidents_by_month_severity %>%group_by(month_name) %>%summarise(total_count =sum(count)) %>%ungroup()severity_types <-c("Fatal", "Serious", "Slight")for(severity in severity_types) { severity_data <- accidents_by_month_severity %>%filter(accident_severity_chr == severity) %>%select(month_name, count) severity_table_complete <-merge(total_accidents_by_month[, c("month_name")], severity_data, by ="month_name", all.x =TRUE) severity_table_complete[is.na(severity_table_complete)] <-0 expected_proportion <- overall_proportions$overall_proportion[overall_proportions$accident_severity_chr == severity] expected_counts <- total_accidents_by_month %>%mutate(expected_count = total_count * expected_proportion) chi_squared_test <-chisq.test(severity_table_complete$count, p = expected_counts$expected_count /sum(expected_counts$expected_count)) chi_sq_results_month <-rbind(chi_sq_results_month, data.frame(Severity = severity, Chi_Squared = chi_squared_test$statistic, DF = chi_squared_test$parameter, P_Value = chi_squared_test$p.value))}# Round the P-Values firstchi_sq_results_month$P_Value <-round(chi_sq_results_month$P_Value, 2)# Then replace values less than 0.01 with "<0.01"chi_sq_results_month$P_Value <-ifelse(chi_sq_results_month$P_Value <0.01, "<0.01", as.character(chi_sq_results_month$P_Value))# Now, your knitr::kable() should work without errorsknitr::kable(chi_sq_results_month, format ="simple", caption ="Chi-Squared Test Results for Each Severity Type Across Months")
Chi-Squared Test Results for Each Severity Type Across Months
Severity
Chi_Squared
DF
P_Value
X-squared
Fatal
16.4
11
0.13
X-squared1
Serious
196.2
11
<0.01
X-squared2
Slight
342.2
11
<0.01
Our next step involved conducting a chi-squared test to statistically validate these findings. The results, as visualized in the table above, confirmed a significant relationship between the number of slight (Chi-Squared = 342.2, DF = 11, P-Value < 0.01) and serious (Chi-Squared = 196.2, DF = 11, P-Value < 0.01) accidents and the month of the year (p-value<0.01). Intriguingly, no such association was found for fatal accidents, as it has a p-value larger than 1% at 13%
This observation is indeed plausible, as fatal accidents may have been influenced by factors that remained relatively stable throughout the year, showing little monthly variation. To elaborate, factors such as weather conditions, which could have varied by month, might have had a more significant impact on determining whether an accident resulted in a serious or slight outcome, while potentially having less influence on the likelihood of a fatality.
Interestingly, this finding aligned with our analysis of spatial characteristic, which revealed that adverse weather conditions such as rain and ice tended to decrease the odds of more serious and fatal accidents. This could be attributed to the heightened caution exercised by drivers in such conditions.
Characteristic
N
OR1
95% CI1
p-value
month_name
99,387
January
—
—
February
1.02
0.94, 1.10
0.68
March
1.04
0.96, 1.12
0.32
April
1.11
1.03, 1.19
0.008
May
1.09
1.02, 1.18
0.015
June
1.06
0.99, 1.14
0.088
July
1.17
1.09, 1.26
<0.001
August
1.15
1.07, 1.23
<0.001
September
1.08
1.01, 1.16
0.032
October
1.07
1.00, 1.15
0.064
November
1.05
0.98, 1.13
0.16
December
1.01
0.94, 1.09
0.77
1 OR = Odds Ratio, CI = Confidence Interval
To refine the analysis once again, logistic regression was applied, focusing on slight and serious accidents, given the statistically insignificant findings regarding fatal accidents. The regression aimed to discern the risk levels of accident severity for each month. The findings were nuanced; not all months exhibited a statistically significant variance in the likelihood of serious versus slight accidents. However, certain months stood out, in line with our observations from the previous chart.
During the transition from spring to summer, beginning in April, there was a gradual uptick in the likelihood of serious accidents. This trend achieved statistical significance in April (OR = 1.11, p = 0.008) and May (OR = 1.09, p = 0.015), potentially linked to increased travel frequency due to better weather.The summer months, particularly July (OR = 1.17, p = 0) and August (OR = 1.15, p = 0), showed the highest odds ratios for serious accidents, both statistically significant. The increased risk during these months might have been attributed to the summer holidays, which typically result in greater traffic volume and a diverse range of drivers, including tourists, potentially raising the chance of serious incidents.September, marking the end of summer, also noted a rise in risk (OR = 1.08, p = 0.032), possibly related to the resurgence of regular traffic patterns and the commencement of the school term.In contrast, months such as February, March, and December did not show significant shifts in the odds of accident severity. This could be due to a variety of reasons, such as consistent driving behavior, stable road conditions, or uniform traffic volumes during these periods. The lack of significant findings in some months suggest that factors other than the time of year might have been more influential in determining accident severity. For example, the impact of road safety campaigns, law enforcement activities, or amendments to driving legislation could have contributed to a more uniform influence on the severity of accidents, overshadowing any seasonal effects.
4.2.4 Key Findings on Temporal Patterns in Road Accidents
In our analysis, we identified distinct patterns related to the time of day, day of the week, and month of the year. In this section, we provide a summary of the key findings that answer our research question.
Time of Day:
Accidents were most frequent during the afternoon hours (12-6 PM).
Early morning hours (0-6 AM) exhibited a higher proportion of severe accidents (fatal and serious), possibly due to factors like driver fatigue, alcohol use, or compromised visibility.
Statistical tests and logistic regression analysis confirmed a strong association between the time of day and accident severity.
Day of the Week:
Saturdays recorded the highest average number of accidents, while Mondays had the lowest.
Despite lower accident frequency, Mondays and Sundays exhibited the highest proportions of fatal and serious accidents.
Statistical tests and logistic regression analysis confirmed a significant association between the day of the week and accident severity, Mondays standing out as the riskiest day for road accidents in terms of severity.
Month of the Year:
November experienced the highest number of accidents, with a significant peak.
Proportions of accident severity varied by month, with July and August having a slightly higher proportion of serious accidents.
Statistical tests confirmed a significant relationship between slight and serious accidents and the month of the year.
Logistic regression revealed that the risk of serious accidents increased from April to August.
4.3 Research Question 3: Do demographics and vehicle characteristics affect road accidents and their severity?
Building upon the methodology employed to address our earlier research question, our exploratory analysis provided valuable insights into variables deserving further examination in this section. In this analysis, we introduced a more comprehensive approach, offering a nuanced perspective on how accidents relate to demographic attributes and vehicle types, while also exploring the persistence of these findings when evaluating accident severity. In order to avoid any potential biases when analyzing the demographic associations to road accidents, this section focuses strictly on drivers of vehicles.
4.3.1 Gender
Code
gender_severity_proportions <- filtered_q3a_clean %>%group_by(sex_chr, casualty_severity_chr) %>%summarise(count =n()) %>%ungroup() %>%group_by(sex_chr) %>%mutate(proportion = count /sum(count)) %>%ungroup()# Assuming gender_severity_proportions is already createdggplot(gender_severity_proportions, aes(x =factor(sex_chr), y = proportion, fill = casualty_severity_chr)) +geom_bar(stat ="identity", position ="stack") +geom_text(aes(label = scales::percent(proportion, accuracy =0.01)), position =position_stack(vjust =0.5), color ="black", size =3 ) +scale_fill_manual(values =c("Light"="#FFEB00", "Serious"="#BBBBBB", "Fatal"="#4C4E4D"), name ="Severity") +labs(title ="Proportion of Casualty Severity by Gender", x ="Gender", y ="Proportion") +theme_minimal()
In our exploratory data analysis, we looked at the total number of accidents categorized by gender and severity. This served as our starting point for identifying any patterns or differences, and we noticed that men had a higher overall count of accidents across all levels of severity. The chart above displays the proportion of each severity category relative to gender. This allowed us to take a more comprehensive look at the data while controlling for the imbalanced number of accidents with men and women. The bar chart clearly showed a significantly higher percentage of slight (77.7%), serious (20.84%) and fatal (1.43%) accidents among males compared to females.
Chi-Squared Test Results for Gender Proportions by Severity Level
Severity Level
Chi-Squared
Degrees of Freedom
P-Value
Light
141
1
<0.01
Serious
489
1
<0.01
Fatal
114
1
<0.01
To confirm these findings and account for differences in accident counts between males and females, we conducted a chi-squared test on these proportions to assess the statistical significance of these gender-based differences. Our test results demonstrated highly significant associations between accident severity and gender across all severity levels (see table in margin), underscoring a clear difference in accident frequencies between the two genders at all levels of severity (p-value <0.01).
Further quantitative analysis through logistic regression, with males as the reference group, revealed that females were significantly less likely to be involved in serious or fatal accidents. Specifically, females exhibited 0.55 times lower odds of being in such accidents compared to males in the UK in 2022 (OR: 0.55, 95% CI: [0.52, 0.57], p-value < 0).
This pattern may imply that males are either engaging in riskier driving behaviors or are exposed to more high-risk situations. In contrast, the predominance of minor accidents among females might reflect a more cautious driving style or different patterns of vehicle usage.
Our analysis aligns with Li et al.’s 1998 study, which found a higher incidence of severe and fatal accidents among men. While historical data suggested that women previously faced a greater risk of serious injury in accidents of comparable severity, recent trends indicate a reduction in this gender disparity, potentially due to changes in vehicle safety features and driving behaviors, as discussed by Brumbelow & Jermakian in 2022. This aligns with current discussions on the “gendered data gap” and its implications on vehicle safety design and assessment.
4.3.2 Age
Using a similar approach to that of our gender analysis, we examined the total accident counts for each age group during our exploratory data analysis. This initial examination revealed a peak in accidents among relatively young individuals, particularly those aged 18-29. To delve deeper into this trend, we assessed the proportion of slight, serious, and fatal accidents across different age groups in our analysis.
To achieve this, we utilized a kernel density plot to visualize the age distribution within the three accident severity levels. While the age distribution for light and severe accidents exhibited similar patterns to that of the right skewed distribution of accidents, a distinctive observation emerged: fatal accidents remained relatively stable across various age groups (see blue square on chart). This suggests that, although the elderly experienced fewer accidents, they faced a higher risk of mortality in such accidents.
Characteristic
N
OR1
95% CI1
p-value
age_of_casualty
69,076
1.01
1.01, 1.01
<0.001
1 OR = Odds Ratio, CI = Confidence Interval
The odds ratio for age was 1.008 (95% CI 1.01-1.01, p <0.001), indicating that with each additional year, the likelihood of being involved in a serious or fatal accident, as opposed to a slight one, increased by 0.845%. This incremental yet consistent increase highlighted a crucial aspect: as individuals age, their risk of being involved in more severe accidents escalated slightly each year. These findings are in line with those of the EU commission of mobility and transport who found that older individuals tended to be more at risk for serious and fatal injuries, especially those over +75 who are 5x more at risk compared to the other average of all ages (European Commission, n.d.).
4.3.3 Vehicle Characteristics:
After analyzing demographics, we sought to determine if certain types of vehicles were more prone to accidents of varying severity than others. To do this, we employed our established methodology, initially creating a bar chart to visualize the distribution of slight, serious, and fatal accidents across different vehicle categories, including cars, bicycles, motorcycles, and trucks.
The bar chart revealed that motorcycle accidents displayed a higher percentage of serious accidents (34.51%) and fatal accidents (2.18%) compared to all other vehicle categories. Conversely, truck accidents exhibited the highest proportion of slight accidents (90.44%) and the lowest incidence of serious accidents (8.61%). Interestingly, despite their vulnerability on the road, cyclists had the lowest percentage of fatal accidents (0.56%).
Following this, we conducted a logistic regression analysis to assess the likelihood of experiencing serious or fatal accidents across various vehicle categories, using cars as the reference group (log odds of 1 – reference)
The results were that cyclists were 2.496 times more likely, and motorcyclists over 4.537 times more likely, to be involved in a more serious or fatal accident compared to car drivers. In contrast, trucks were associated with a reduced likelihood, being 0.173 times less likely than cars to be engaged in serious or fatal accidents. These findings held strong statistical significance, with p-values below 0.01, indicating a robust difference in the risk of accident severity based on vehicle type.
Combining the insights from the bar chart and logistic regression analysis, we concluded that while slight severity accidents predominated across all vehicle categories, motorcycles stood out with a significantly higher risk of serious or fatal accidents. The statistical analysis confirmed that motorcycles carried a substantially greater risk of serious outcomes when compared to cars, while cyclists also had elevated odds, though not as high as motorcycles. In contrast, trucks appeared to be associated with a reduced risk of serious or fatal accidents compared to cars.
4.4 Research Question 4: Can we predict the severity of a road accident?
Code
library(caret)logit_data_clean <-na.omit(logit_data)# Ensure categorical variables are factorslogit_data_clean$month <-factor(logit_data_clean$month)logit_data_clean$day <-factor(logit_data_clean$day)logit_data_clean$road_type <-factor(logit_data_clean$road_type)logit_data_clean$speed_limit <-factor(logit_data_clean$speed_limit)logit_data_clean$light_conditions <-factor(logit_data_clean$light_conditions)logit_data_clean$weather_conditions <-factor(logit_data_clean$weather_conditions)logit_data_clean$road_surface_conditions <-factor(logit_data_clean$road_surface_conditions)logit_data_clean$urban_or_rural_area <-factor(logit_data_clean$urban_or_rural_area)logit_data_clean$driver_imd_decile <-factor(logit_data_clean$driver_imd_decile)logit_data_clean$special_conditions_at_site <-factor(logit_data_clean$special_conditions_at_site)logit_data_clean$sex_of_driver <-factor(logit_data_clean$sex_of_driver)# Splitting the data into training and testing setsset.seed(123) # For reproducibility if needed# Stratified sampling based on 'accident_severity'trainIndex <-createDataPartition(logit_data$accident_severity, p =0.8, list =FALSE)# Creating training and testing setstrain_data <- logit_data[trainIndex, ]test_data <- logit_data[-trainIndex, ]table(train_data$accident_severity)# Checking the distribution in the testing settable(test_data$accident_severity)# Proportion in the training setprop_train_slight <-table(train_data$accident_severity)["Slight"] /nrow(train_data)prop_train_serious_fatal <-table(train_data$accident_severity)["Serious/Fatal"] /nrow(train_data)# Proportion in the testing setprop_test_slight <-table(test_data$accident_severity)["Slight"] /nrow(test_data)prop_test_serious_fatal <-table(test_data$accident_severity)["Serious/Fatal"] /nrow(test_data)# Print proportions #cat("Proportion of 'Slight' in Training Set:", prop_train_slight, "\n")#cat("Proportion of 'Serious/Fatal' in Training Set:", prop_train_serious_fatal, "\n")#cat("Proportion of 'Slight' in Testing Set:", prop_test_slight, "\n")#cat("Proportion of 'Serious/Fatal' in Testing Set:", prop_test_serious_fatal, "\n")# Backward Selection with selected variablesfull_model <-glm(accident_severity ~ month + day + road_type + speed_limit + light_conditions + weather_conditions + road_surface_conditions + urban_or_rural_area + engine_capacity_cc + special_conditions_at_site + sex_of_driver + age_of_casualty + hour, data = train_data, family =binomial())backward_model <-step(full_model, direction ="backward", trace =0)# Forward Selection with selected variablesnull_model <-glm(accident_severity ~1, data = train_data, family =binomial())forward_model <-step(null_model, scope =list(lower = null_model, upper = full_model), direction ="forward", trace =0)# Choose the best model based on AICfinal_model <-if (AIC(backward_model) <AIC(forward_model)) { backward_model} else { forward_model}# Predict on the testing setpredictions <-predict(final_model, test_data, type ="response")predicted_class <-ifelse(predictions >0.5, 'Non-Slight', 'Slight')# Convert predictions to a factor with appropriate levelspredicted_class_factor <-factor(predicted_class, levels =c('Slight', 'Non-Slight'))# Convert the test data accident_severity to a factor with the same levelstest_data$accident_severity_factor <-factor(test_data$accident_severity, levels =c('Slight', 'Non-Slight'))# Evaluate the model with a confusion matrixconf_matrix <-confusionMatrix(predicted_class_factor, test_data$accident_severity_factor)# Print out the confusion matrix and overall statisticsprint(conf_matrix)print(conf_matrix$overall)
When we initially conceived our project, our objective was to create a model capable of predicting accident severity based on significant variables identified in our report. However, these significant variables were spread across multiple datasets, making it highly complex to effectively gather and integrate them.
Moreover, we encountered many other challenges.
An initial observation from our analysis revealed a significant discrepancy in the occurrence of serious and fatal accidents compared to slight accidents (which is why we conducted chi tests using proportions to mitigate for this disproportion in previous parts of our analysis).
This observation indicated that when we were going to run a model to predict the severity of an accident, the model would be highly biased towards slight accidents over more serious ones. Attempting to mitigate this issue, we started by employing a 80/20 data split into training and test sets and implemented stratified sampling to ensure that both groups had an equal representation of slight and serious/fatal accidents.
This left us with the following split:
Training vs Test Split
Slight Accidents
Serious/Fatal Accidents
Train
23844
6877
Test
5961
1719
Subsequently, we conducted model training using both forwards and backwards step-wise selection methods and received the following results:
Confusion Matrix
Prediction
Slight
Serious/Fatal
Slight
5942
0
Non-Slight
19
0
Despite our efforts to address the data imbalance issue, our model’s performance yielded extremely poor results. While our model achieved 99.7% accuracy, this metric proved misleading due to the inherent dataset imbalance. Notably, the model struggled to accurately identify ‘Non-Slight’ (Serious/Fatal) cases, resulting in a sensitivity of 99.7%. However, the absence of true positive cases for the ‘Non-Slight’ class rendered specificity incalculable. A Kappa value of 0 indicated an agreement no better than random chance, raising concerns about the model’s predictive power, particularly for the minority class. Our model selection process, based on backward and forward selection using AIC, may have contributed to potential over-fitting. Additionally, it’s possible that the selected variables did not adequately capture the complexity required for class differentiation. Consequently, exploring alternative techniques, model approaches, or metrics to enhance severity prediction accuracy might be necessary. Nevertheless, this might require additional expertise and resources beyond our current knowledge. We’ve left our code attempts hidden, if you’d like to see our process, please feel free.
Code
library(caret)library(ROSE)# Remove missing values from the datasetlogit_data_clean <-na.omit(logit_data)# Splitting the data into training and testing setsset.seed(123)train_index <-sample(1:nrow(logit_data_clean), 0.8*nrow(logit_data_clean))train_data <- logit_data_clean[train_index, ]test_data <- logit_data_clean[-train_index, ]# Apply resampling to balance the training databalanced_data <-ovun.sample(accident_severity ~ ., data = train_data, method ="both", N =200000)$data# Backward Selection on the balanced datasetfull_model <-glm(accident_severity ~ ., data = balanced_data, family =binomial())backward_model <-step(full_model, direction ="backward", trace =0)# Forward Selection on the balanced datasetnull_model <-glm(accident_severity ~1, data = balanced_data, family =binomial())forward_model <-step(null_model, scope =list(lower = null_model, upper = full_model), direction ="forward", trace =0)# Choose the best model based on AIC (assuming backward_model is chosen for simplicity)final_model <- backward_model# Predict on the testing setpredictions <-predict(final_model, test_data, type ="response")predicted_class <-ifelse(predictions >0.5, 'Non-Slight', 'Slight')# Convert predictions to a factor with appropriate levelspredicted_class_factor <-factor(predicted_class, levels =c('Slight', 'Non-Slight'))# Convert the test data accident_severity to a factor with the same levelstest_data$accident_severity_factor <-factor(test_data$accident_severity, levels =c('Slight', 'Non-Slight'))# Evaluate the model with a confusion matrixconf_matrix <-confusionMatrix(predicted_class_factor, test_data$accident_severity_factor)# Print out the confusion matrix and overall statisticsprint(conf_matrix)print(conf_matrix$overall)
5. Conclusion
5.1 Take home message
Our comprehensive analysis of road accidents in the UK for the year 2022 provided vital insights into the multifaceted nature of road safety. Key takeaways from our study underscore the intricate interplay of spatial, temporal, demographic, and vehicle-related factors in influencing both the frequency and severity of road accidents. Specifically, our findings highlighted the heightened risk of severe accidents in certain urban areas like Blackpool, the pronounced vulnerability of motorcycles to serious accidents, and the increased likelihood of fatal accidents among older demographics. Temporally, the most hazardous times on the road emerged as early morning hours and Saturdays, revealing patterns that align with societal routines and behaviors. The study also revealed a notable gender disparity, with men being more prone to being involved in severe accidents when compared to women.
Importantly, our research illustrated that numerous variables contribute to road accidents and their severity. This complexity underscores the vital importance of road users remaining vigilant at all times. It is imperative to drive only when well and alert, adhere strictly to speed limitations, and always be mindful of the surrounding road conditions and traffic. Additionally, the use of safety features such as seat belts and adhering to traffic rules cannot be overstated. These measures, along with a heightened awareness of the factors identified in our study, can significantly mitigate the risks associated with road usage. Ultimately, the responsibility for road safety lies with each road user, and our collective adherence to safety practices can make a substantial difference in reducing accidents and enhancing overall road safety.
5.2 Limitations
The pursuit of this research encountered several methodological challenges that must be acknowledged.
Primarily, the manipulation of three intricate datasets presented complexities in the consolidation of information pertaining to individual accidents. The datasets, while rich in data, were structured in a manner that made it challenging to align and retrieve comprehensive details for each accident event.
Secondly, in our spatial analysis, we normalized data using population metrics. However, upon reflection, traffic volume would have been a superior measure as it directly correlates with the potential for road incidents. The reliance on population data may have introduced discrepancies in the assessment of accident frequency and severity by region.
Thirdly, the scope of our research questions introduced a significant degree of complexity to our analysis. The breadth of each question was such that it could have merited a separate, dedicated study. This expansive approach occasionally constrained the depth of exploration into each specific domain of road safety.
Finally, while envisioning our project, we set out to create a model capable of predicting accident severity based on multiple variables. However, despite our best attempts, our model consistently predicted accidents as ‘slight’ in severity, due a substantial imbalance in the distribution of accident severity within our dataset, which revealed the complexity of accurately predicting accident severity in such conditions. Despite our best efforts, we were unable to create a model that could successfully predict the severity of a road accident.
These limitations underscore the imperative for a judicious interpretation of our findings. They also pave the path for future research endeavors, which should consider the application of more precise and relevant data measures, such as traffic volume, and a narrower, more focused approach to research questions to provide deeper, more actionable insights into road safety.
5.3 Future work
Our report laid a solid foundation in addressing some critical questions related to road accidents in the UK. Moving forward, there are several avenues for future work that could enhance our understanding and contribute to road safety.
Firstly, exploring temporal trends beyond the initial investigation, such as long-term patterns and emerging trends, could provide valuable insights into the dynamics of road accidents. While our current report focused on the data from the year 2022, expanding the temporal scope by including data from multiple years would enable us to identify long-term trends and patterns in road accidents, enhancing the general applicability of our findings. Moreover, considering additional years could reveal evolving patterns, contributing to a more dynamic and adaptive approach to road safety strategies.
Furthermore, in future research, it would be valuable to evaluate and find datasets that are more precise about the road network. This could include factors such as the miles traveled in specific areas, regional regulations, and a more in-depth analysis of various aspects of these regions to understand the objective factors that influenced accident severity.
Additionally, a more detailed examination of demographics and vehicle characteristics could encompass factors such as the socioeconomic status and occupation of drivers, as well as the maintenance records and technological features of vehicles, to better understand their contributions to both the occurrence and severity of accidents.
Lastly, building on the groundwork, the development of a predictive model for accident severity could be refined and expanded, incorporating more variables and employing advanced machine-learning techniques to improve accuracy and applicability.
Bose, D., Segui-Gomez, ScD, M., & Crandall, J. R. (2011). Vulnerability of Female Drivers Involved in Motor Vehicle Crashes: An Analysis of US Population at Risk. Am J Public Health, 101(12), 2368–2373. https://doi.org/10.2105/ajph.2011.300275
Brumbelow, M. L., & Jermakian, J. S. (2022). Injury risks and crashworthiness benefits for females and males: Which differences are physiological? Traffic Injury Prevention, 23(1), 11–16. https://doi.org/10.1080/15389588.2021.2004312
Li, G., Baker, S. P., Langlois, J. A., & Kelen, G. D. (1998). Are Female Drivers Safer? An Application of the Decomposition Method. Epidemiology, 9(4), 379–384. https://doi.org/10.1097/00001648-199807000-00006